Statistics: The Compass for Navigating a Data-Centric World Marie Davidian

advertisement
Statistics:
The Compass for Navigating
a Data-Centric World
Marie Davidian
Department of Statistics
North Carolina State University
January 11, 2013
Statistics2013 Video
Statistics2013 Video
Available at http://statistics2013.org
Triumph of the geeks
Nate Silver predicted the outcome of the 2012 US presidential
election in all 50 states
Triumph of the geeks
Nate Silver predicted the outcome of the 2012 US presidential
election in all 50 states using . . .
Statistics
Triumph of the geeks
Nate Silver predicted the outcome of the 2012 US presidential
election in all 50 states using . . .
Statistics
http://fivethirtyeight.blogs.nytimes.com/
Silver used a statistical model to combine the results of
state-by-state polls, weighting them according their previous
accuracy, and to simulate many elections and estimate
probabilities of the outcome
Triumph of the geeks
Others did, too. . .
“Dynamic Bayesian forecasting of presidential elections in the
states,” by Drew A. Linzer, Journal of the American Statistical
Association, in press
Triumph of the geeks
“Nate Silver-led statistics men crush pundits in election”
– Bloomberg Businessweek
“Nate Silver has made statistics sexy again”
– Associated Press
“Drew Linzer: The stats man who predicted Obama’s win”
– BBC News Magazine
“The allure of the statistics field grows”
– Boston Globe
Triumph of the geeks
“Nate Silver-led statistics men crush pundits in election”
– Bloomberg Businessweek
“Nate Silver has made statistics sexy again”
– Associated Press
“Drew Linzer: The stats man who predicted Obama’s win”
– BBC News Magazine
“The allure of the statistics field grows”
– Boston Globe
But the interest in statistics didn’t start with the
US elections. . .
Statistics in the news
New York Times, August 6, 2009
“I keep saying that the sexy job in the next 10 years will be
statisticians” – Hal Varian, Chief Economist, Google
Statistics in the news
New York Times, January 26, 2012
“I went to parties and heard a little groan when people heard
what I did. Now they’re all excited to meet me” – Rob
Tibshirani, Department of Statistics, Stanford University
Statistics in the news
New York Times, February 11, 2012
“Statistics are interesting and fun. It’s cool now” – Andrew
Gelman, Department of Statistics, Columbia University
Statistics in the news
The Wall Street Journal, December 28, 2012
Carl Bialik, The Numbers Guy
Data, data, and more data
Why is there so much talk of statistics and
statisticians?
Data, data, and more data
Why is there so much talk of statistics and
statisticians?
Data
Data, data, and more data
Why is there so much talk of statistics and
statisticians?
Data
• Administrative (e.g., tax records), government surveys
• Genomic, meteorological, air quality, seismic, . . .
• Electronic medical records, health care databases
• Credit card transactions, point-of-sale, mobile phone
• Online search, social networks
• Polls, voter registration records
A veritable tsunami/deluge/avalanche of data
Demand
2011 McKinsey Global Institute report:
Big data: The next frontier for innovation,
competition, and productivity
“A significant constraint. . . will be a shortage of . . . people with
deep expertise in statistics and data mining. . . a talent gap of
140K - 190K positions in 2018 (in the US)”
http://www.mckinsey.com/insights/mgi/research/technology and innovation/
big data the next frontier for innovation
Opportunities and challenges
• Our ability to collect, store, access, and manipulate vast
and complex data is ever-improving
• The potential benefits to science and society of learning
from these data are enormous
Opportunities and challenges
• Our ability to collect, store, access, and manipulate vast
and complex data is ever-improving
• The potential benefits to science and society of learning
from these data are enormous
•
However, Big Data does not automatically mean
Big Information
• Science, decision-making, and policy formulation require
not only prediction and finding associations and patterns,
but uncovering causal relationships
• Which, as we’ll discuss later, is not so easy. . .
Perils
From “The Age of Big Data”
With huge data sets and fine-grained measurement,. . . there is
increased risk of “false discoveries.” The trouble with seeking a
meaningful needle in massive haystacks of data, says Trevor
Hastie, a statistics professor at Stanford, is that “many bits of
straw look like needles.”
Perils
From “The Age of Big Data”
With huge data sets and fine-grained measurement,. . . there is
increased risk of “false discoveries.” The trouble with seeking a
meaningful needle in massive haystacks of data, says Trevor
Hastie, a statistics professor at Stanford, is that “many bits of
straw look like needles.”
Big Data also supplies more raw material for statistical
shenanigans and biased fact-finding excursions. It offers a
high-tech twist on an old trick: I know the facts, now let’s find
’em. That is, says Rebecca Goldin, a mathematician at George
Mason University, “one of the most pernicious uses of data.”
Critical need
Sound, objective methods for modeling,
analysis, and interpretation
Critical need
Sound, objective methods for modeling,
analysis, and interpretation
Statistics
Critical need
Sound, objective methods for modeling,
analysis, and interpretation
Statistics
While Big Data have inspired considerable current interest in
statistics, statistics has been fundamental in numerous areas of
science, business, and government for decades
Roadmap
•
A brief history
Roadmap
•
•
A brief history
Statistical stories
Roadmap
•
•
•
A brief history
Statistical stories
Our data-rich future
What is statistics?
What is statistics?
Statistics: The science of learning from data
and of measuring, controlling, and
communicating uncertainty
What is statistics?
Statistics: The science of learning from data
and of measuring, controlling, and
communicating uncertainty
The path to what is now the formal discipline of statistical
science is long and winding. . .
Origins – pre-1700
• Sporadic accounts of measurement and data collection
and interpretation date back as early as 5 B.C.
Origins – pre-1700
• Sporadic accounts of measurement and data collection
and interpretation date back as early as 5 B.C.
• But it was not until the the mid-1600s that the
mathematical notions of probability began to be developed
by (mainly) mathematicians and physicists (e.g., Blaise
Pascal), often inspired by games of chance
Origins – pre-1700
• Sporadic accounts of measurement and data collection
and interpretation date back as early as 5 B.C.
• But it was not until the the mid-1600s that the
mathematical notions of probability began to be developed
by (mainly) mathematicians and physicists (e.g., Blaise
Pascal), often inspired by games of chance
• The first formal attempt to summarize and learn from data
was by John Graunt, who created a precursor to modern
life tables used in demography
• Christiaan Huygens was among the first to connect such
data analysis to probability
Origins – 1700-1750
• From 1700 to 1750, many key results in classical
probability that underlie statistical theory were derived
Origins – 1700-1750
• From 1700 to 1750, many key results in classical
probability that underlie statistical theory were derived
• Jakob Bernoulli– law of large numbers, the Bernoulli and
binomial probability distributions
• Abraham de Moivre – The Doctrine of Chances, precursor
to the central limit theorem
• Daniel Bernoulli – expected utility, applications of
probability to measurement problems in astronomy
Milestone events – 1750-1820
• Thomas Bayes’ 1763 An essay towards solving a problem
in the Doctrine of Chances presented a special case of
Bayes’ theorem (posthumously)
Milestone events – 1750-1820
• Thomas Bayes’ 1763 An essay towards solving a problem
in the Doctrine of Chances presented a special case of
Bayes’ theorem (posthumously)
• Arien-Marie Legendre described the method of least
squares in 1805
Milestone events – 1750-1820
• Thomas Bayes’ 1763 An essay towards solving a problem
in the Doctrine of Chances presented a special case of
Bayes’ theorem (posthumously)
• Arien-Marie Legendre described the method of least
squares in 1805
●
●
●
●
●
●
●
●
●
●
●
●
●
Milestone events – 1750-1820
• Thomas Bayes’ 1763 An essay towards solving a problem
in the Doctrine of Chances presented a special case of
Bayes’ theorem (pothumously)
• Arien-Marie Legendre described the method of least
squares in 1805
●
●
●
●
●
●
●
●
●
●
●
●
●
Milestone events – 1750-1820
• Thomas Bayes’ 1763 An essay towards solving a problem
in the Doctrine of Chances presented a special case of
Bayes’ theorem (pothumously)
• Arien-Marie Legendre described the method of least
squares in 1805
Milestone events – 1750-1820
• Carl Fredrich Gauss connected least squares to Bayes
theorem in 1809
Milestone events – 1750-1820
• Carl Fredrich Gauss connected least squares to Bayes
theorem in 1809
• Pierre-Simon Laplace derived the central limit theorem and
connected the normal probability distribution to least
squares in 1810
Milestone events – 1750-1820
• Carl Fredrich Gauss connected least squares to Bayes
theorem in 1809
• Pierre-Simon Laplace derived the central limit theorem and
connected the normal probability distribution to least
squares in 1810
Milestone events – 1750-1820
• Carl Fredrich Gauss connected least squares to Bayes
theorem in 1809
• Pierre-Simon Laplace derived the central limit theorem and
connected the normal probability distribution to least
squares in 1810
More milestones – 1820-1900
• Aldolphe Quetelet pioneered the statistical analysis of
social science data – the “average man” (1835) and the
normal distribution as a model for measurements (1842)
More milestones – 1820-1900
• Aldolphe Quetelet pioneered the statistical analysis of
social science data – the “average man” (1835) and the
normal distribution as a model for measurements (1842)
• The Royal Statistical Society (1834) and American
Statistical Association (1839) were founded
More milestones – 1820-1900
• Aldolphe Quetelet pioneered the statistical analysis of
social science data – the “average man” (1835) and the
normal distribution as a model for measurements (1842)
• The Royal Statistical Society (1834) and American
Statistical Association (1839) were founded
• Francis Galton introduced regression analysis (1885) and
correlation (1888)
• Karl Pearson established the field of biometry and
developed fundamental methods, and founded the first
statistical journal, Biometrika (1901)
Modern statistics – 1900-1950s
The modern discipline of statistics was really
established only in the twentieth century
Modern statistics – 1900-1950s
The modern discipline of statistics was really
established only in the twentieth century
• William Gosset (“Student”), a brewer for Guinness in
Dublin, derived the Student’s t distribution in 1908
Modern statistics – 1900-1950s
The modern discipline of statistics was really
established only in the twentieth century
• William Gosset (“Student”), a brewer for Guinness in
Dublin, derived the Student’s t distribution in 1908
• In the 1920s, Ronald Fisher developed many fundamental
concepts, including the ideas of statistical models and
randomization, theory of experimental design, the method
of analysis of variance, and tests of significance
Modern statistics – 1900-1950s
The modern discipline of statistics was really
established only in the twentieth century
• William Gosset (“Student”), a brewer for Guinness in
Dublin, derived the Student’s t distribution in 1908
• In the 1920s, Ronald Fisher developed many fundamental
concepts, including the ideas of statistical models and
randomization, theory of experimental design, the method
of analysis of variance, and tests of significance
• In the 1930s, Jerzy Neyman and Egon Pearson developed
the theory of sampling, the competing approach of
hypothesis testing, and the concept of confidence intervals
Modern statistics – 1900-1950s
The modern discipline of statistics was really
established only in the twentieth century
• William Gosset (“Student”), a brewer for Guinness in
Dublin, derived the Student’s t distribution in 1908
• In the 1920s, Ronald Fisher developed many fundamental
concepts, including the ideas of statistical models and
randomization, theory of experimental design, the method
of analysis of variance, and tests of significance
• In the 1930s, Jerzy Neyman and Egon Pearson developed
the theory of sampling, the competing approach of
hypothesis testing, and the concept of confidence intervals
• Experimental design became a mainstay of agricultural
research
Modern statistics – 1900-1950s
• Fisher/Neyman-Pearson established the paradigm of
frequentist statistical inference that is used today
Modern statistics – 1900-1950s
• Fisher/Neyman-Pearson established the paradigm of
frequentist statistical inference that is used today
• Also in the 1930s, Bayesian statistical inference was
developed by Bruno de Finetti and others
Modern statistics – 1900-1950s
• Fisher/Neyman-Pearson established the paradigm of
frequentist statistical inference that is used today
• Also in the 1930s, Bayesian statistical inference was
developed by Bruno de Finetti and others
• In the 1940s, many departments of statistics were
established at universities in the US and Europe
• And fundamental theory of statistical inference was
pursued by Wald, Cramér, Rao and many others
Modern statistics to the present
From the 1950s on, there were numerous
advances in theory, methods, and application
• The advent of medical statistics and epidemiological
methods (Richard Doll, Austin Bradford Hill)
• The development of methods for analysis of censored
time-to-event data (Paul Meier, D.R. Cox)
• The use of the theory of sampling to design surveys and
the US census (Jerzy Neyman, Morris Hansen)
• The adoption of statistical quality control and experimental
design in industry (W. Edwards Deming, George Box)
• Exploratory data analysis (John Tukey)
• And many, many more. . .
Modern statistics to the present
Computing fundamentally altered the field of
statistics forever
• Complex calculations became feasible
• Much larger and more complicated data sets could be
created and analyzed
• Sophisticated models and methods could be applied
Modern statistics to the present
Computing fundamentally altered the field of
statistics forever
• Complex calculations became feasible
• Much larger and more complicated data sets could be
created and analyzed
• Sophisticated models and methods could be applied
• Statistical software implementing popular methods became
widespread (e.g., SAS, developed at NC State in the
1960s/70s)
Modern statistics to the present
Computing fundamentally altered the field of
statistics forever
• Complex calculations became feasible
• Much larger and more complicated data sets could be
created and analyzed
• Sophisticated models and methods could be applied
• Statistical software implementing popular methods became
widespread (e.g., SAS, developed at NC State in the
1960s/70s)
• Simulation to investigate performance of statistical
methods became possible
• Bayesian statistical methods became feasible in complex
settings (Markov chain Monte Carlo – MCMC)
Today
Statistical methods are used routinely in
science, industry/business, and government
• Pharmaceutical companies employ statisticians, who work
in all stages of drug development
Today
Statistical methods are used routinely in
science, industry/business, and government
• Pharmaceutical companies employ statisticians, who work
in all stages of drug development
• Statisticians are ubiquitous in medical and public health
research, working with health sciences researchers to
design studies, analyze data, and draw conclusions
Today
Statistical methods are used routinely in
science, industry/business, and government
• Pharmaceutical companies employ statisticians, who work
in all stages of drug development
• Statisticians are ubiquitous in medical and public health
research, working with health sciences researchers to
design studies, analyze data, and draw conclusions
• Google, Facebook, LinkedIn, credit card companies, global
retailers employ statisticians to develop and implement
methods to mine their vast data
Today
Statistical methods are used routinely in
science, industry/business, and government
• Pharmaceutical companies employ statisticians, who work
in all stages of drug development
• Statisticians are ubiquitous in medical and public health
research, working with health sciences researchers to
design studies, analyze data, and draw conclusions
• Google, Facebook, LinkedIn, credit card companies, global
retailers employ statisticians to develop and implement
methods to mine their vast data
• Government science, regulatory, and statistical agencies
employ statisticians to design surveys, make forecasts,
develop estimates of income, review new drug applications,
assess evidence of health effects of pollutants, . . .
Statistical stories
Some diverse examples where statistics and
statisticians are essential. . .
The controlled clinical trial
The gold standard study for comparison of
treatments (a question of cause and effect)
The controlled clinical trial
The gold standard study for comparison of
treatments (a question of cause and effect)
• An experiment designed to compare a new treatment to a
control treatment
• Subjects are randomized to receive one treatment or the
other ⇒ unbiased, fair comparison using statistical
methods (hypothesis testing)
The controlled clinical trial
The gold standard study for comparison of
treatments (a question of cause and effect)
• An experiment designed to compare a new treatment to a
control treatment
• Subjects are randomized to receive one treatment or the
other ⇒ unbiased, fair comparison using statistical
methods (hypothesis testing)
• In addition, blinding, placebo
• The first such clinical trial was conducted in the UK by the
Medical Research Council in 1948, comparing
streptomycin+bed rest to bed rest alone in tuberculosis
• In 1954, 800K children in the US were randomized to the
Salk polio vaccine or placebo to assess the vaccine’s
effectiveness in preventing paralytic polio
The controlled clinical trial
• In 1969, evidence from a randomized clinical trial became
mandatory for a new product to receive approval from the
US Food and Drug Administration (FDA)
The controlled clinical trial
• In 1969, evidence from a randomized clinical trial became
mandatory for a new product to receive approval from the
US Food and Drug Administration (FDA)
• Because a trial involves only a sample of patients from the
entire population, the results are subject to uncertainty
• Statistical methods are critical for determining the sample
size required to ensure that a real difference can be
detected with a specified degree of confidence
The controlled clinical trial
• In 1969, evidence from a randomized clinical trial became
mandatory for a new product to receive approval from the
US Food and Drug Administration (FDA)
• Because a trial involves only a sample of patients from the
entire population, the results are subject to uncertainty
• Statistical methods are critical for determining the sample
size required to ensure that a real difference can be
detected with a specified degree of confidence
• Which is why regulatory bodies like the FDA employ 100s
of statisticians
The controlled clinical trial
• In 1969, evidence from a randomized clinical trial became
•
•
•
•
•
mandatory for a new product to receive approval from the
US Food and Drug Administration (FDA)
Because a trial involves only a sample of patients from the
entire population, the results are subject to uncertainty
Statistical methods are critical for determining the sample
size required to ensure that a real difference can be
detected with a specified degree of confidence
Which is why regulatory bodies like the FDA employ 100s
of statisticians
In the last 4 decades, statisticians have developed new
methods to handle ethical and practical considerations
E.g., group sequential trials that allow interim analyses at
which the trial can be stopped early without compromising
the ability to make a valid comparison
The controlled clinical trial
National forest inventory
Next stop, Bhutan
• The Kingdom of Bhutan, in South Asia, transitioned to a
constitutional democracy in 2008
• The new constitution mandates that Bhutan maintain 60%
forest cover in perpetuity
• A National Forest Inventory was called for. . .
National forest inventory
Next stop, Bhutan
• The Kingdom of Bhutan, in South Asia, transitioned to a
constitutional democracy in 2008
• The new constitution mandates that Bhutan maintain 60%
forest cover in perpetuity
• A National Forest Inventory was called for. . .
• My friend Tim Gregoire of Yale University, an expert in
forest biometry, was consulted to help plan and implement
Bhutan’s comprehensive NFI
National forest inventory
National forest inventory
A NFI is an assessment based on statistical
sampling and estimation of the forest resources
of a nation
• Set policy on forest resource management
• Monitor biodiversity, habitat type and extent, land
conversion rates
• Measure quantity/quality of wood fiber for commodities
• Measure non-wood forest products
• Measure carbon storage and change
• Reference spatially where resources are located
National forest inventory
A NFI is an assessment based on statistical
sampling and estimation of the forest resources
of a nation
• Set policy on forest resource management
• Monitor biodiversity, habitat type and extent, land
conversion rates
• Measure quantity/quality of wood fiber for commodities
• Measure non-wood forest products
• Measure carbon storage and change
• Reference spatially where resources are located
Statistics is critical to developing the sampling plan for both
remote sensing and field data and to estimation of abundance
of resources based on 100s of measurements
National forest inventory
Pharmacokinetics
What’s behind a drug label?
• A drug should be safe and effective
• Labeling provides guidance on dose, conditions under
which a drug should/should not be taken
• Partly behind this – pharmacokinetics (PK), the science of
“what the body does to the drug”
Pharmacokinetics
What’s behind a drug label?
• A drug should be safe and effective
• Labeling provides guidance on dose, conditions under
which a drug should/should not be taken
• Partly behind this – pharmacokinetics (PK), the science of
“what the body does to the drug”
• Key: Understanding Absorption, Distribution, Metabolism,
Excretion in the population and how these processes vary
across patients and are altered by conditions
• Statistical modeling is an integral part of the science
Pharmacokinetics
A hierarchical statistical model that allows these processes to
vary across patients and conditions is fitted to drug
concentration-time data
Pharmacokinetics
Conc(t) =
ka Dose
[exp{−(Cl/V )t} − exp(−ka t)]
V (ka − Cl/V )
ka = absorption rate, V = volume of distribution, Cl = clearance
Forensic science
An area where statisticians and better statistics
are desperately needed!
• Fingerprints, DNA analysis, bite marks, firearm toolmarks,
hair specimens, writing samples, toxicological analysis,. . .
• Laboratory- or expert interpretation-based
• 2009 US National Academy of Sciences report
• The report cites examples of lack of sufficient recognition
of sources of variability and their effects on uncertainties in
many types of forensic science analyses. . .
Forensic science
“With the exception of nuclear DNA analysis, however, no forensic
method has been rigorously shown to have the capacity to
consistently, and with a high degree of certainty, demonstrate a
connection between evidence and a specific individual or source.”
“A body of research is required to establish the limits and measures
of performance and to address the impact of sources of variability
and potential bias.”
“The development of quantifiable measures of uncertainty in the
conclusions of forensic analyses . . . and of quantifiable measures of
the reliability and accuracy of forensic analyses (are needed).”
Basically, the report recommends that current and new forensic
practices should be developed and assessed using properly
designed experiments and statistical methods!
The hazards of haphazard data
When data are simply observed and collected,
without a principled design and randomization,
be wary!
• Investigations of causal relationships can be compromised
by confounding
• E.g., comparison of the effects of competing treatments
• When individual patients and their providers decide which
treatment to take, there may be factors that are associated
with both the choice of treatment and outcome
• Failure to recognize/identify such confounding factors can
lead to misleading conclusions
Simpson’s paradox
Average Outcome
Data on 2 treatments from a healthcare database
Avg Trt B
Avg Trt A
Simpson’s paradox
Data on 2 treatments from a healthcare database
Trt A
Average Outcome
Trt B
Trt A
Trt B
Male
Female
Simpson’s paradox
Data on 2 treatments from a healthcare database
Trt A
Average Outcome
Trt B
Avg Trt B
Avg Trt A
Trt A
Trt B
Male
A: 80%/20% M/F
B: 20%/80% M/F
Female
Confounding and other threats
• Statistical methods are available to take confounding into
appropriate account
• . . . but the confounding factors must be recorded in the
database!
Other threats
• Missing information – why are some factors not recorded
for some individuals?
• Drop out – sicker patients may disappear sooner in a
longitudinal study
• Etc
Comparative effectiveness research, which strives
to recommend best uses for existing treatment through
analyses of such databases, requires statistics!
Confronting our data-rich future
I hope I have convinced you that statistics and statisticians are
essential to our data-rich future!
Confronting our data-rich future
I hope I have convinced you that statistics and statisticians are
essential to our data-rich future!
Big Data have enormous potential for new generating new
knowledge and improving human welfare. However, Big Data
without statistics have enormous potential to mislead.
Confronting our data-rich future
I hope I have convinced you that statistics and statisticians are
essential to our data-rich future!
Big Data have enormous potential for new generating new
knowledge and improving human welfare. However, Big Data
without statistics have enormous potential to mislead.
“The future demands that scientists, policy-makers, and the
public be able to interpret increasingly complex information and
recognize both the benefits and pitfalls of statistical analysis.
Embedding statistics in science and society will pave the route
to a data-informed future, and statisticians must lead this
charge.”
– Davidian and Louis, Science, April 6, 2012
2013 – the International Year of Statistics
A celebration of the contributions of statistics is
long overdue!
http://statistics2013.org
References and further reading
Aldrich, J. Figures from the history of probability and statistics.
http://www.economics.soton.ac.uk/staff/aldrich/Figures.htm
Davidian, M. and Louis, T.A. (2012). Why statistics? Science, 336, 12.
Feinberg, S.E. (1992). A brief history of statistics in three and one
half chapters: A review essay. Statistical Science, 7, 208–225.
Stigler, S.M. (1986). The History of Statistics: The Measurement of
Uncertainty Before 1900. Harvard University Press.
Download