Statistics: The Compass for Navigating a Data-Centric World Marie Davidian Department of Statistics North Carolina State University January 11, 2013 Statistics2013 Video Statistics2013 Video Available at http://statistics2013.org Triumph of the geeks Nate Silver predicted the outcome of the 2012 US presidential election in all 50 states Triumph of the geeks Nate Silver predicted the outcome of the 2012 US presidential election in all 50 states using . . . Statistics Triumph of the geeks Nate Silver predicted the outcome of the 2012 US presidential election in all 50 states using . . . Statistics http://fivethirtyeight.blogs.nytimes.com/ Silver used a statistical model to combine the results of state-by-state polls, weighting them according their previous accuracy, and to simulate many elections and estimate probabilities of the outcome Triumph of the geeks Others did, too. . . “Dynamic Bayesian forecasting of presidential elections in the states,” by Drew A. Linzer, Journal of the American Statistical Association, in press Triumph of the geeks “Nate Silver-led statistics men crush pundits in election” – Bloomberg Businessweek “Nate Silver has made statistics sexy again” – Associated Press “Drew Linzer: The stats man who predicted Obama’s win” – BBC News Magazine “The allure of the statistics field grows” – Boston Globe Triumph of the geeks “Nate Silver-led statistics men crush pundits in election” – Bloomberg Businessweek “Nate Silver has made statistics sexy again” – Associated Press “Drew Linzer: The stats man who predicted Obama’s win” – BBC News Magazine “The allure of the statistics field grows” – Boston Globe But the interest in statistics didn’t start with the US elections. . . Statistics in the news New York Times, August 6, 2009 “I keep saying that the sexy job in the next 10 years will be statisticians” – Hal Varian, Chief Economist, Google Statistics in the news New York Times, January 26, 2012 “I went to parties and heard a little groan when people heard what I did. Now they’re all excited to meet me” – Rob Tibshirani, Department of Statistics, Stanford University Statistics in the news New York Times, February 11, 2012 “Statistics are interesting and fun. It’s cool now” – Andrew Gelman, Department of Statistics, Columbia University Statistics in the news The Wall Street Journal, December 28, 2012 Carl Bialik, The Numbers Guy Data, data, and more data Why is there so much talk of statistics and statisticians? Data, data, and more data Why is there so much talk of statistics and statisticians? Data Data, data, and more data Why is there so much talk of statistics and statisticians? Data • Administrative (e.g., tax records), government surveys • Genomic, meteorological, air quality, seismic, . . . • Electronic medical records, health care databases • Credit card transactions, point-of-sale, mobile phone • Online search, social networks • Polls, voter registration records A veritable tsunami/deluge/avalanche of data Demand 2011 McKinsey Global Institute report: Big data: The next frontier for innovation, competition, and productivity “A significant constraint. . . will be a shortage of . . . people with deep expertise in statistics and data mining. . . a talent gap of 140K - 190K positions in 2018 (in the US)” http://www.mckinsey.com/insights/mgi/research/technology and innovation/ big data the next frontier for innovation Opportunities and challenges • Our ability to collect, store, access, and manipulate vast and complex data is ever-improving • The potential benefits to science and society of learning from these data are enormous Opportunities and challenges • Our ability to collect, store, access, and manipulate vast and complex data is ever-improving • The potential benefits to science and society of learning from these data are enormous • However, Big Data does not automatically mean Big Information • Science, decision-making, and policy formulation require not only prediction and finding associations and patterns, but uncovering causal relationships • Which, as we’ll discuss later, is not so easy. . . Perils From “The Age of Big Data” With huge data sets and fine-grained measurement,. . . there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.” Perils From “The Age of Big Data” With huge data sets and fine-grained measurement,. . . there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.” Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now let’s find ’em. That is, says Rebecca Goldin, a mathematician at George Mason University, “one of the most pernicious uses of data.” Critical need Sound, objective methods for modeling, analysis, and interpretation Critical need Sound, objective methods for modeling, analysis, and interpretation Statistics Critical need Sound, objective methods for modeling, analysis, and interpretation Statistics While Big Data have inspired considerable current interest in statistics, statistics has been fundamental in numerous areas of science, business, and government for decades Roadmap • A brief history Roadmap • • A brief history Statistical stories Roadmap • • • A brief history Statistical stories Our data-rich future What is statistics? What is statistics? Statistics: The science of learning from data and of measuring, controlling, and communicating uncertainty What is statistics? Statistics: The science of learning from data and of measuring, controlling, and communicating uncertainty The path to what is now the formal discipline of statistical science is long and winding. . . Origins – pre-1700 • Sporadic accounts of measurement and data collection and interpretation date back as early as 5 B.C. Origins – pre-1700 • Sporadic accounts of measurement and data collection and interpretation date back as early as 5 B.C. • But it was not until the the mid-1600s that the mathematical notions of probability began to be developed by (mainly) mathematicians and physicists (e.g., Blaise Pascal), often inspired by games of chance Origins – pre-1700 • Sporadic accounts of measurement and data collection and interpretation date back as early as 5 B.C. • But it was not until the the mid-1600s that the mathematical notions of probability began to be developed by (mainly) mathematicians and physicists (e.g., Blaise Pascal), often inspired by games of chance • The first formal attempt to summarize and learn from data was by John Graunt, who created a precursor to modern life tables used in demography • Christiaan Huygens was among the first to connect such data analysis to probability Origins – 1700-1750 • From 1700 to 1750, many key results in classical probability that underlie statistical theory were derived Origins – 1700-1750 • From 1700 to 1750, many key results in classical probability that underlie statistical theory were derived • Jakob Bernoulli– law of large numbers, the Bernoulli and binomial probability distributions • Abraham de Moivre – The Doctrine of Chances, precursor to the central limit theorem • Daniel Bernoulli – expected utility, applications of probability to measurement problems in astronomy Milestone events – 1750-1820 • Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (posthumously) Milestone events – 1750-1820 • Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (posthumously) • Arien-Marie Legendre described the method of least squares in 1805 Milestone events – 1750-1820 • Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (posthumously) • Arien-Marie Legendre described the method of least squares in 1805 ● ● ● ● ● ● ● ● ● ● ● ● ● Milestone events – 1750-1820 • Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (pothumously) • Arien-Marie Legendre described the method of least squares in 1805 ● ● ● ● ● ● ● ● ● ● ● ● ● Milestone events – 1750-1820 • Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (pothumously) • Arien-Marie Legendre described the method of least squares in 1805 Milestone events – 1750-1820 • Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 Milestone events – 1750-1820 • Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810 Milestone events – 1750-1820 • Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810 Milestone events – 1750-1820 • Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810 More milestones – 1820-1900 • Aldolphe Quetelet pioneered the statistical analysis of social science data – the “average man” (1835) and the normal distribution as a model for measurements (1842) More milestones – 1820-1900 • Aldolphe Quetelet pioneered the statistical analysis of social science data – the “average man” (1835) and the normal distribution as a model for measurements (1842) • The Royal Statistical Society (1834) and American Statistical Association (1839) were founded More milestones – 1820-1900 • Aldolphe Quetelet pioneered the statistical analysis of social science data – the “average man” (1835) and the normal distribution as a model for measurements (1842) • The Royal Statistical Society (1834) and American Statistical Association (1839) were founded • Francis Galton introduced regression analysis (1885) and correlation (1888) • Karl Pearson established the field of biometry and developed fundamental methods, and founded the first statistical journal, Biometrika (1901) Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • In the 1930s, Jerzy Neyman and Egon Pearson developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • In the 1930s, Jerzy Neyman and Egon Pearson developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals • Experimental design became a mainstay of agricultural research Modern statistics – 1900-1950s • Fisher/Neyman-Pearson established the paradigm of frequentist statistical inference that is used today Modern statistics – 1900-1950s • Fisher/Neyman-Pearson established the paradigm of frequentist statistical inference that is used today • Also in the 1930s, Bayesian statistical inference was developed by Bruno de Finetti and others Modern statistics – 1900-1950s • Fisher/Neyman-Pearson established the paradigm of frequentist statistical inference that is used today • Also in the 1930s, Bayesian statistical inference was developed by Bruno de Finetti and others • In the 1940s, many departments of statistics were established at universities in the US and Europe • And fundamental theory of statistical inference was pursued by Wald, Cramér, Rao and many others Modern statistics to the present From the 1950s on, there were numerous advances in theory, methods, and application • The advent of medical statistics and epidemiological methods (Richard Doll, Austin Bradford Hill) • The development of methods for analysis of censored time-to-event data (Paul Meier, D.R. Cox) • The use of the theory of sampling to design surveys and the US census (Jerzy Neyman, Morris Hansen) • The adoption of statistical quality control and experimental design in industry (W. Edwards Deming, George Box) • Exploratory data analysis (John Tukey) • And many, many more. . . Modern statistics to the present Computing fundamentally altered the field of statistics forever • Complex calculations became feasible • Much larger and more complicated data sets could be created and analyzed • Sophisticated models and methods could be applied Modern statistics to the present Computing fundamentally altered the field of statistics forever • Complex calculations became feasible • Much larger and more complicated data sets could be created and analyzed • Sophisticated models and methods could be applied • Statistical software implementing popular methods became widespread (e.g., SAS, developed at NC State in the 1960s/70s) Modern statistics to the present Computing fundamentally altered the field of statistics forever • Complex calculations became feasible • Much larger and more complicated data sets could be created and analyzed • Sophisticated models and methods could be applied • Statistical software implementing popular methods became widespread (e.g., SAS, developed at NC State in the 1960s/70s) • Simulation to investigate performance of statistical methods became possible • Bayesian statistical methods became feasible in complex settings (Markov chain Monte Carlo – MCMC) Today Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development Today Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions Today Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions • Google, Facebook, LinkedIn, credit card companies, global retailers employ statisticians to develop and implement methods to mine their vast data Today Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions • Google, Facebook, LinkedIn, credit card companies, global retailers employ statisticians to develop and implement methods to mine their vast data • Government science, regulatory, and statistical agencies employ statisticians to design surveys, make forecasts, develop estimates of income, review new drug applications, assess evidence of health effects of pollutants, . . . Statistical stories Some diverse examples where statistics and statisticians are essential. . . The controlled clinical trial The gold standard study for comparison of treatments (a question of cause and effect) The controlled clinical trial The gold standard study for comparison of treatments (a question of cause and effect) • An experiment designed to compare a new treatment to a control treatment • Subjects are randomized to receive one treatment or the other ⇒ unbiased, fair comparison using statistical methods (hypothesis testing) The controlled clinical trial The gold standard study for comparison of treatments (a question of cause and effect) • An experiment designed to compare a new treatment to a control treatment • Subjects are randomized to receive one treatment or the other ⇒ unbiased, fair comparison using statistical methods (hypothesis testing) • In addition, blinding, placebo • The first such clinical trial was conducted in the UK by the Medical Research Council in 1948, comparing streptomycin+bed rest to bed rest alone in tuberculosis • In 1954, 800K children in the US were randomized to the Salk polio vaccine or placebo to assess the vaccine’s effectiveness in preventing paralytic polio The controlled clinical trial • In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) The controlled clinical trial • In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) • Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty • Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence The controlled clinical trial • In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) • Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty • Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence • Which is why regulatory bodies like the FDA employ 100s of statisticians The controlled clinical trial • In 1969, evidence from a randomized clinical trial became • • • • • mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence Which is why regulatory bodies like the FDA employ 100s of statisticians In the last 4 decades, statisticians have developed new methods to handle ethical and practical considerations E.g., group sequential trials that allow interim analyses at which the trial can be stopped early without compromising the ability to make a valid comparison The controlled clinical trial National forest inventory Next stop, Bhutan • The Kingdom of Bhutan, in South Asia, transitioned to a constitutional democracy in 2008 • The new constitution mandates that Bhutan maintain 60% forest cover in perpetuity • A National Forest Inventory was called for. . . National forest inventory Next stop, Bhutan • The Kingdom of Bhutan, in South Asia, transitioned to a constitutional democracy in 2008 • The new constitution mandates that Bhutan maintain 60% forest cover in perpetuity • A National Forest Inventory was called for. . . • My friend Tim Gregoire of Yale University, an expert in forest biometry, was consulted to help plan and implement Bhutan’s comprehensive NFI National forest inventory National forest inventory A NFI is an assessment based on statistical sampling and estimation of the forest resources of a nation • Set policy on forest resource management • Monitor biodiversity, habitat type and extent, land conversion rates • Measure quantity/quality of wood fiber for commodities • Measure non-wood forest products • Measure carbon storage and change • Reference spatially where resources are located National forest inventory A NFI is an assessment based on statistical sampling and estimation of the forest resources of a nation • Set policy on forest resource management • Monitor biodiversity, habitat type and extent, land conversion rates • Measure quantity/quality of wood fiber for commodities • Measure non-wood forest products • Measure carbon storage and change • Reference spatially where resources are located Statistics is critical to developing the sampling plan for both remote sensing and field data and to estimation of abundance of resources based on 100s of measurements National forest inventory Pharmacokinetics What’s behind a drug label? • A drug should be safe and effective • Labeling provides guidance on dose, conditions under which a drug should/should not be taken • Partly behind this – pharmacokinetics (PK), the science of “what the body does to the drug” Pharmacokinetics What’s behind a drug label? • A drug should be safe and effective • Labeling provides guidance on dose, conditions under which a drug should/should not be taken • Partly behind this – pharmacokinetics (PK), the science of “what the body does to the drug” • Key: Understanding Absorption, Distribution, Metabolism, Excretion in the population and how these processes vary across patients and are altered by conditions • Statistical modeling is an integral part of the science Pharmacokinetics A hierarchical statistical model that allows these processes to vary across patients and conditions is fitted to drug concentration-time data Pharmacokinetics Conc(t) = ka Dose [exp{−(Cl/V )t} − exp(−ka t)] V (ka − Cl/V ) ka = absorption rate, V = volume of distribution, Cl = clearance Forensic science An area where statisticians and better statistics are desperately needed! • Fingerprints, DNA analysis, bite marks, firearm toolmarks, hair specimens, writing samples, toxicological analysis,. . . • Laboratory- or expert interpretation-based • 2009 US National Academy of Sciences report • The report cites examples of lack of sufficient recognition of sources of variability and their effects on uncertainties in many types of forensic science analyses. . . Forensic science “With the exception of nuclear DNA analysis, however, no forensic method has been rigorously shown to have the capacity to consistently, and with a high degree of certainty, demonstrate a connection between evidence and a specific individual or source.” “A body of research is required to establish the limits and measures of performance and to address the impact of sources of variability and potential bias.” “The development of quantifiable measures of uncertainty in the conclusions of forensic analyses . . . and of quantifiable measures of the reliability and accuracy of forensic analyses (are needed).” Basically, the report recommends that current and new forensic practices should be developed and assessed using properly designed experiments and statistical methods! The hazards of haphazard data When data are simply observed and collected, without a principled design and randomization, be wary! • Investigations of causal relationships can be compromised by confounding • E.g., comparison of the effects of competing treatments • When individual patients and their providers decide which treatment to take, there may be factors that are associated with both the choice of treatment and outcome • Failure to recognize/identify such confounding factors can lead to misleading conclusions Simpson’s paradox Average Outcome Data on 2 treatments from a healthcare database Avg Trt B Avg Trt A Simpson’s paradox Data on 2 treatments from a healthcare database Trt A Average Outcome Trt B Trt A Trt B Male Female Simpson’s paradox Data on 2 treatments from a healthcare database Trt A Average Outcome Trt B Avg Trt B Avg Trt A Trt A Trt B Male A: 80%/20% M/F B: 20%/80% M/F Female Confounding and other threats • Statistical methods are available to take confounding into appropriate account • . . . but the confounding factors must be recorded in the database! Other threats • Missing information – why are some factors not recorded for some individuals? • Drop out – sicker patients may disappear sooner in a longitudinal study • Etc Comparative effectiveness research, which strives to recommend best uses for existing treatment through analyses of such databases, requires statistics! Confronting our data-rich future I hope I have convinced you that statistics and statisticians are essential to our data-rich future! Confronting our data-rich future I hope I have convinced you that statistics and statisticians are essential to our data-rich future! Big Data have enormous potential for new generating new knowledge and improving human welfare. However, Big Data without statistics have enormous potential to mislead. Confronting our data-rich future I hope I have convinced you that statistics and statisticians are essential to our data-rich future! Big Data have enormous potential for new generating new knowledge and improving human welfare. However, Big Data without statistics have enormous potential to mislead. “The future demands that scientists, policy-makers, and the public be able to interpret increasingly complex information and recognize both the benefits and pitfalls of statistical analysis. Embedding statistics in science and society will pave the route to a data-informed future, and statisticians must lead this charge.” – Davidian and Louis, Science, April 6, 2012 2013 – the International Year of Statistics A celebration of the contributions of statistics is long overdue! http://statistics2013.org References and further reading Aldrich, J. Figures from the history of probability and statistics. http://www.economics.soton.ac.uk/staff/aldrich/Figures.htm Davidian, M. and Louis, T.A. (2012). Why statistics? Science, 336, 12. Feinberg, S.E. (1992). A brief history of statistics in three and one half chapters: A review essay. Statistical Science, 7, 208–225. Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press.