Introduction to Data Science Section 6 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu Data Science Examples in Action Google Flu • Link between search terms used in Google and actual flu outbreaks. • The idea is that people with the flu or who are in areas where more folks have the flu will be searching more about it. • Baseline model used searches on 45 terms related to flu and flu symptoms – http://www.google.org/flutrends/about/how.html Billion Prices Project @ MIT • Collects prices from hundreds of online retailers every day. – http://bpp.mit.edu/ • Measures health of the economy in near real time. • Two days after Lehman Brothers failed in Sept. 2008, their index began to fall. – Govt. released data in mid-November on October figures before it could show similar evidence. Medicare Fraud Detection • 90 people charged with $260 million in false billing – https://www.fbi.gov/newyork/pressreleases/2014/medicare-fraud-strike-forcecharges-90-individuals-for-approximately-260million-in-false-billing Twitter Trending • Twitter trending – https://twitter.com/WhatsTrending Nate Silver • FiveThirtyEight – http://fivethirtyeight.com/ Computational Event Data System • Automated coding of English-language news reports to generate political event data in the Middle East, Balkans, and West Africa. • The focus is on contentious political events (bombings, shootings, protests, riots, etc.) • Records initiator, target, location, “size” – http://eventdata.parusanalytics.com/ Satellite Data and Crop Insurance Fraud • Mapping satellite images of fields (of 100 acres or larger in Nebraska) • Map “greenness” of the field against crop insurance claims, then investigate claims on the ground. • Good predictor of fraud – http://data2discovery.org/dev/wpcontent/uploads/2014/05/DMysnyk_Unsupervised_Le arning_Classification.jpeg Interest Groups and Policy Diffusion Garrett and Jansa, forthcoming SPPQ Plagiarizing Policy Kroeger, State Politics Conference 2015 • Compares model bills with actual legislation – 64 groups; 1434 model bills – 7,542,937 versions of bills – Results in 10,816,571,658 pairwise comparisons • Measured similarity based on percentage of 5word strings that appear in both model and actual bills. – https://www.dropbox.com/sh/i4ud5e6tyap1qp2/AAC RSIANXFLC8U5Ln9Bn6usca/Kroeger_SPPC.pdf?dl=0 Communication Networks Desmarais et al. • Email from New Hanover County, North Carolina – 30 Managers of departments of county govt. – Feb., 2011 • 30,909 emails • 8,097 authored by managers • 1,739 sent to other managers • Study within vs. between group communication, broken down by topics – http://people.umass.edu/bruced/pubs/Krafft_Mo ore_Desmarais_Wallach_NIPS2012.pdf Privacy, Ethics, Transparency “Could” versus “Should” • Advances in Data Science focus more on what could be done than what should be done. • Current law has not caught up with current technology. • Do people understand what they are giving away about themselves? • Are researchers being open about what they are doing with data? • Are companies? • Are governments? Data Access and Research Transparency (DA-RT) • Increased regulations from government to provide access to data funded by government dollars. • Increased professional publication standards for sharing research data • Increased pressure for transparency in research methods more generally, including data access • Symposium in Political Science (PS: Political Science and Politics, Vol. 47, Issue 01.) – http://journals.cambridge.org/action/displayIssue?jid =PSC&volumeId=47&seriesId=0&issueId=01 DA-RT Statement • Commitment to DA-RT principles by academic journals. – Require authors to provide research data via a trusted repository – Require authors to provide clear instructions/code used for analysis – Maintain a consistent data citation policy – Update all codes, guidelines, etc. as needed. – http://media.wix.com/ugd/fa8393_da017d3fed82 4cf587932534c860ea25.pdf Retracted Research • Science article on changing minds on same-sex marriage • Donald Green and Michael LaCour • http://retractionwatch.com/2015/05/20/authorretracts-study-of-changing-minds-on-same-sexmarriage-after-colleague-admits-data-were-faked/ • Who is responsible? – Authors, publishers, advisors, IRBs, professional societies, others? Grown in Retractions • NY Times: Nature reports a 10 fold increase in retracted scientific research in 2000s compared to 1990s. • http://www.nytimes.com/interactive/2015/05/28/science/r etractions-scientific-studies.html?_r=0 • http://www.nature.com/news/2011/111005/pdf/478026a.p df • Best guess is improved standards and technology is finding more problems, not that bad behavior itself is on the rise, but we don’t know. Retracted Studies Live On • Studies by John Budd at U. Missouri (School of Ed.) finds that – Retracted articles continue to be cited just as often as non-retracted articles. – Only 4-8% of citations mention the retraction • There is an effort to deal with this called CrossMark – http://www.crossref.org/crossmark/ • More attention to retracted work at Retraction Watch – http://retractionwatch.com/ Other Retractions • 1998 Study linking vaccines to autism retracted in 2010 • http://www.nytimes.com/2010/02/03/health/research/03lancet.h tml • 2014 article in Nature on Stem Cell production method • http://www.nytimes.com/2014/07/03/business/stem-cellresearch-papers-are-retracted.html • 2004 and 2005 papers in Science on human cloning • http://www.nytimes.com/2005/12/31/science/amid-confusionjournal-retracts-koreans-stem-cell-paper.html There is More • John Darsee faked data in about 100 published papers on heart diseas • Study linking vaccines to autism • Study touting a new method of producing stem cells • Two studies on human cloning • Nearly 100 papers by John Darsee on heart research • 17 papers on Physics by J. Hendrik Schon Privacy and Ethics • Data, the elements of data science, and even so-called “Big Data” are not new. • One thing that is new is the greater variety of data and, most importantly, the amount of data available about humans. • Discussion and good policy regarding privacy, security, and the ethical use of data about people lags behind the methods of collecting, sharing, archiving, and analyzing data. 25 The Free Market, Unfair Competition, Big Brother? 26 Privacy and Ethics • Several questions emerge from the explosion of digital data about us: – Identity: What is the relationships between our offline and online identities? – Privacy: Who should control access to online data? – Ownership: Who owns data? Can ownership rights be transferred? What are the obligations of those who generate and use data? – Reputation: How can we determine if data is trustworthy? How can we know if provider acquired data legitimately? • There are entire industries on reputation management. Ethical Claims • Davis claims “Big Data, like all technology, is ethically neutral” • However, business is a social/public enterprise, so you will encounter the values others hold. • Competing values means competing ethical standards, and thus, ethical conflicts. • Modern Big data magnifies this encounter because it increases the points of connection. • It also leaves a digital trail of those encounters. Ethical Decision Points • Inquiry: what are our values? • Analysis: what are our current data handling practices? Do they align with our values? • Articulation: explicit written expression of alignment and gaps between values and practices • Action: specific plans and activities used to close those gaps. Buying vs. Selling • Company data policies: – 34 of 50 Fortune-class companies said the would not sell personal data. – No company said explicitly that the would sell personal data. – No company made any explicit statement that they would not buy personal data – 11 companies said explicitly that buying personal data was allowed. • Why different standards for buying vs. selling personal data? Privacy • • • • Physical Privacy Informational Privacy Organizational Privacy Privacy of – Our communications – Our behavior – Our person • Is digital communal living different than physical communal living? • Who decides? Who regulates? Who Enforces? Privacy as a Common Pool Resource? • We all benefit from privacy norms, but we all may realize personal gains by violating them. • If everyone has that same incentive, the public good gets destroyed. • To govern our collective behavior: – We need norms/rules that are agreed upon and known – We need effective monitoring – We need a mechanism to sanction violations • The question is, do we need an external force or can we govern ourselves? (Elinor Ostrom’s work) Closing Thoughts on Data Science Conclusions • Data Science is an evolving field – Exciting, confusing, immature • Data science will be critical in an information economy and to national security, but it is also changing our social behavior, the arts, and everything else. • There are many claims made about data science and “Big Data,” and some of them are probably true. • Focused on applied interaction between computer science, information science, and statistics. – This is good, but . . . 36 Conclusions (cont.) • Data Science needs to figure out how to include substantive expertise and theories. – Finding patterns and gaining understanding are two different tasks. • Data Science needs to foster interdisciplinary communication – Data as the universal language • Data Science needs greater attention to privacy and ethics.