Big data: What are you missing?  the risks of assuming data equals “all”  David J. Hand 

advertisement
 Big data: What are you missing? the risks of assuming data equals “all” David J. Hand Imperial College London and Winton Capital Management 6th January 2016 Theory of Big Data, UCL 1 BACKGROUND The promise of big data: McKinsey’s big data report: ‘we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors exploit its potential’ and endless ditto by others Theory of Big Data, UCL 2 However, as I have argued elsewhere 1) big data is not the solution it’s what you do with it that counts 2) big data carries risks Theory of Big Data, UCL 3 Two kinds of big data opportunities 1) Computer science: through data manipulation merging, linking, matching, concatenating, sorting, basic arithmetic, ... Database heritage: could conceivably have all the data e.g. stock in the warehouse e.g. employees in the firm 2) Statistics: through inference and predictive analytics Many (most?) problems cannot have all the data e.g. observations in clinical trials e.g. forecasting e.g. physics experiments Theory of Big Data, UCL 4 The challenges of big data 1) Computational and mathematical challenges ‐ large n and/or d
‐ speed of acquisition, realtime analysis Hand’s Law: the requirements for increased computer power always increase faster than the increase in power itself 2) Inferential and statistical challenges ‐ complexity – networks, mixed data types, ... Theory of Big Data, UCL 5 3) Data challenges ‐ data quality ‐ non‐stationarity ‐ formulating the question ‐ correlation vs causation ‐ . . . . . . . . Theory of Big Data, UCL 6 THE AIM OF THIS PAPER: To focus on one problem and show how it is pervasive in big data opportunities ‐ risking misleading conclusions ‐ incorrect understanding ‐ mistaken decisions ‐ wasted money ‐ . . . . . . And to show what’s needed to tackle it Theory of Big Data, UCL 7 This is the problem of SELECTION BIAS Theory of Big Data, UCL 8 SOME EXAMPLES: Example 1: Potholes ‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics Theory of Big Data, UCL 9 SOME EXAMPLES: Example 1: Potholes ‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics ‐ But lower income people less likely to have smartphones and cars, older people less likely to have smartphones, ... → streets in richer areas get fixed Theory of Big Data, UCL 10 Example 2: Hurricane Sandy 20 million tweets between 27 October and 1 November 2012 But a distorted impression of where problems are: ‐ most tweets came from Manhattan ‐ few from “more severely affected locations, such as Breezy Point, Coney Island and Rockaway” ‐ because of relative density of population/smartphones ‐ because power outages meant phones not recharged → distorted impression of where the damage occurred
Theory of Big Data, UCL 11 Example 3: Retail finance scorecard construction Aim: build model to decide which applicants should be given a loan Data: characteristics and (default/repay) outcome of those granted loans in past ‐ but those granted loans in the past were selected on the basis of some previous scorecard ‐ they do not represent the entire population of applicants Same structure for student selection, staff recruitment, ..... Theory of Big Data, UCL 12 Example 4: Crime rates Points to note: 1) The difference between the CSE&W and PRC 2) The dramatic fall in CSE&W from 1995 Theory of Big Data, UCL 13 1) Crime Survey for E&W versus Police Recorded Crime CSE&W: aged ≥ 16; children 10‐15; not group residences; not crimes against commercial or public sector bodies; victim‐
based (not include murder); not fraud and cyber; capping repeat victimisation; ... PRC: reported to and recorded by police; crime defined by “Notifiable Offence List” (incl. murder, public order, ...); incl. residents of institutions and tourists; incl. commercial bodies; 2) CSE&W: 19m in 1995 to 7m in y.e. June 2015 Less crime or shifting patterns of crime e.g to fraud, not measured on CSE&W Theory of Big Data, UCL 14 Plastic card fraud in the UK, 2004‐2014 Theory of Big Data, UCL 15 Example 5: Publication bias Relevant factors include: ‐ tendency not to submit negative results (file‐drawer effect) ‐ positive results are more interesting to editors; ‐ anomalous results may be regarded as errors, and not submitted; In an exploration of publication bias in the Cochrane database of systematic reviews: “In the meta‐analyses of efficacy, outcomes favoring treatment had on average a 27% ... higher probability to be included than other outcomes. In the meta‐analyses of safety, results showing no evidence of adverse effects were on average 78% ... more likely to be included than results demonstrating that adverse effects existed.” Kicinski et al 5015 Theory of Big Data, UCL 16 WHAT DRIVES SELECTION BIAS: 1) Natural mechanisms Abraham Wald and the WWII bomber armour The bullet holes in returning bombers showed where they could be hit without bringing them down A lesson for business schools? Look at the failures, not the successes Francis Bacon “when they showed him hanging in a temple a picture of those who had paid their vows as having escaped shipwreck, and would have him say whether he did not now acknowledge the power of the gods — ‘Aye,’ asked he again, ‘but where are they painted that were drowned after their vows?’ " Theory of Big Data, UCL 17 2) Non‐response and refusals LFS quarterly survey wave‐specific response rates: March‐May 2000 to July‐Sept 2015 http://www.ons.gov.uk/ons/guide‐method/method‐quality/specific/labour‐market/labour‐force‐survey/index.html Theory of Big Data, UCL 18 3) Self‐selection (i) The magazine survey which asks the one question: do you reply to magazine surveys? (ii) The Literary Digest disastrous prediction that Landon would beat Roosevelt in the 1936 presidential election Standard explanation: the prediction was based on polling people with phones, who are more likely to be Republican But this is a myth In fact 10m people were polled, but only 2.3m replied A self‐selected sample, and in this election the anti‐Roosevelt voters felt more strongly than the pro Theory of Big Data, UCL 19 (iii) The Actuary edition of July 2006 included an editorial which said ‘A couple of months ago I invited you ‐ all 16,245 of you ‐ to participate in our online survey concerning the sex of actuarial offspring. ... Well, I’m pleased to say that a number of you (13, in fact) replied to our poll.’ Particularly web‐based surveys ‐ who replies? ‐ under‐representation of some groups ‐ multiple responding Theory of Big Data, UCL 20 4) Data dredging Test enough (true null) hypotheses and you expect some to be significant by chance This does not have to be dishonest: if 1000 teams each test one true null hypothesis at the 5% level .... Charles Babbage termed such data dredging “cooking”: “make multitudes of observations, and out of these to select only those which agree, or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he cannot pick out fifteen or twenty which will do for serving up” Robert Millikan, Gregor Mendel, .... Theory of Big Data, UCL 21 5) Harking Hypothesising after the results are known Presenting post‐hoc hypotheses as if they were a priori Popperian science: Step 1: data suggest theory Step 2: theory is tested with new data Step 3: loop through steps 1 and 2 Harking arises when the same data are used in Steps 1 and 2 Theory of Big Data, UCL 22 6) Feedback and asymmetric information (i) The market for lemons The buyer of a used car, with no further information on the vehicle in question, offers the average price of such vehicles The seller can keep the better quality ones and sell only the poor quality ones Theory of Big Data, UCL 23 (ii) Crimemaps Theory of Big Data, UCL 24 But People will not bother to report minor crime if they feel there’s no point or for other reasons “More than 5.2 million people have not reported crimes for fear of deterring home buyers or renters since the online crime map was launched in February 2011” “A quarter (24 per cent) of people would not report a crime for fear it would harm their chances of selling or renting their property” http://www.directline.com/media/archive‐2011/news‐11072011
Theory of Big Data, UCL 25 (iii) Evaluating new scorecards Apply incumbent and challenger to a sample of customers But this sample will have been accepted by the incumbent → data asymmetry Standard scorecard performance measures favour the challenger Theory of Big Data, UCL 26 iv) Credit card transaction fraud detection Transaction stream terminated when incumbent detects a fraudulent transaction, not when the challenger does → data asymmetry Standard fraud detection measures favour the incumbent Theory of Big Data, UCL 27 7) Gaming Goodhart’s law: when a measure becomes a target, it ceases to be a good measure “As soon as the Government attempts to regulate any particular set of financial assets, these become unreliable as indicators of economic trends” Investors try to anticipate the effect of the regulation, and adapt to benefit from it Theory of Big Data, UCL 28 Campbell’s law: The more any quantitative social indicator is used for social decision‐making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor ‐ schools enter for public exams only those expected to excel ‐ ambulance response times Bevan and Hamblin, 2009 Theory of Big Data, UCL 29 8) The law EU gender discrimination in insurance and credit Credit scoring aims to “discriminate” between the good and bad credit risks → statistical models to estimate probability of repaying a loan → build the best model we can (benefits company and customer) ‐ include all variables which enhance predictive power ‐ models as sophisticated as we like Theory of Big Data, UCL 30 Practical and ethical problem: US Equal Credit Opportunity Act of 1974 makes it illegal for creditors to discriminate against any applicant on the basis of race, colour, religion, national origin, sex, marital status, or age (similar in other countries); Theory of Big Data, UCL 31 Practical and ethical problem: US Equal Credit Opportunity Act of 1974 makes it illegal for creditors to discriminate against any applicant on the basis of race, colour, religion, national origin, sex, marital status, or age (similar in other countries); Even though, women are generally less risky than men That is: it is illegal to treat differently people who belong to certain groups with known different degrees of risk Theory of Big Data, UCL 32 Disadvantaging females, who have to pay higher rates, and have their loan applications rejected more often Advantaging males, who have to pay lower rates, and have their applications accepted more often Contrast with insurance: where males and females could be charged different premiums Theory of Big Data, UCL 33 Disadvantaging females, who have to pay higher rates, and have their loan applications rejected more often Advantaging males, who have to pay lower rates, and have their applications accepted more often Contrast with insurance: where males and females could be charged different premiums Until the European Court of Justice ruled in 2011 that the use of gender would not be permitted in determining prices and benefits from insurance from 21 December 2012 Theory of Big Data, UCL 34 Now imagine ‐ If the cost of driving insurance is equalised at a weighted mean of the previous male and female values; ‐ then more of the higher risk category will be able to drive on our roads; ‐ increasing the risk to all of us In fact, nearly all age groups saw a drop in premiums, except ‐ women aged 17‐20 saw a rise in their premiums ‐ men of the same age saw the biggest drop Theory of Big Data, UCL 35 9) Underpowered studies “Studies in psychology are endemically underpowered” Bertamini and Munafo, 2012 The law of small numbers: The tendency to generalise from small samples “the mistaken assumption that the law of large numbers applies to small numbers as well” Hand, The Improbability Principle, p194 Theory of Big Data, UCL 36 10) Conditional probabilities Regression to the mean Every technology is overhyped at its birth Theory of Big Data, UCL 37 WHAT TO DO ABOUT IT: 1) Construct and stick to sampling frame Or use “gold samples” Draw some cases from throughout the sample space Then standardise Theory of Big Data, UCL 38 2) Registers e.g. in surveys of people e.g.pre‐registration in clinical trials September 2004: NEJM, Lancet, Annals of Internal Medicine, JAMA: required drug research sponsored by pharmaceutical companies to be pre‐registered ina a public database as a pre‐
condition for publication Theory of Big Data, UCL 39 3) Detecting, e.g. publication bias Caliper tests: ratio of reported results just above and just below the critical value associated with (e.g.) p= 0.05 Funnel plots (and tests derived from them) are based on the law of small numbers ‐ large studies are likely to be published regardless of results ‐ small studies are likely to be published only if the results are “interesting”, i.e. significant Theory of Big Data, UCL 40 A relationship between sample size and effect size is suspicious Hence the overabundance of plots in the bottom right of the funnel and the dearth in the left Theory of Big Data, UCL Copas 1999 41 4) Model the selection mechanism Heckman selection models (Nobel Prize) Copas publication bias correction models Theory of Big Data, UCL 42 CONCLUSION: The danger of selection bias “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.” Chris Anderson Wired in an article called ‘The end of theory: the data deluge makes scientific method obsolete’ Theory of Big Data, UCL 43 The danger is that you don’t know it’s happening The numbers lie for themselves Theory of Big Data, UCL 44 A final example Anthropic bias: The extraordinary coincidence that the universe has exactly the right characteristics for human life to evolve Theory of Big Data, UCL 45 A final example Anthropic bias: The universe must be like it is or we wouldn’t be here to see it Theory of Big Data, UCL 46 Theory of Big Data, UCL 47 thanks Theory of Big Data, UCL 48 
Download