Lecture 1: Measurements, Statistics, Probability, and Data Display Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University July 11, 2011 Introduction to Statistical Measurement and Modeling What is statistics? The study of … (i.) … populations (ii.) …variation (iii.) … methods of the reduction of data. “The original meaning of the word … suggests that it was the study of populations of human beings living in political union.” Sir R. A. Fisher What is statistics? “… Statistical Science [is] the particular aspect of human progress which gives the 20th century its special character…. It is to the statistician that the present age turns for what is most essential in all its more important activities.” Sir R. A. Fisher What is statistics? Less complimentary views “Science is difficult. You need mathematics and statistics, which is dull like learning a language.” Richard Gregory “There are three kinds of lies: lies, damned lies and statistics.” Mark Twain, quoting Disraeli What is statistics? Statistics in concerned with METHODS for COLLECTING & DESCRIBING DATA and then for ASSESSING STRENGTH OF EVIDENCE in DATA FOR/AGAINST SCIENTIFIC IDEAS!” Scott L. Zeger What is statistics? the art and science of gathering, analyzing, and making inferences from data.” Encyclopaedia Britannica Poetry Music Statistics Mathematics Physics What is biostatistics? The science of learning from biomedical data involving appreciable variability or uncertainty. Amalgam Data examples Osteoporosis screening Importance: Osteoporosis afflicts millions of older adults (particularly women) worldwide Lowers quality of life, heightens risk of falls etc. Scientific question: Can we detect osteoporosis earlier and more safely? Method: ultrasound versus dual photon absorptiometry (DPA) tried out on 42 older women Implications: Treatment to slow / prevent onset Osteoporosis data DPA scores by osteoporosis groups 0.6 1600 0.7 1700 0.8 1800 0.9 1.0 1900 1.1 2000 1.2 Ultrasound scores by osteoporosis groups control case control case Data examples Temperature modeling Importance: Climate change is suspected. Heat waves, increased particle pollution, etc. may harm health. Scientific question: Can we accurately and precisely model geographic variation in temperature? Method: Maximum January-average temperature over 30 years in 62 United States cites Implications: Valid temperature models can support future policy planning United States temperature map http://green-enb150.blogspot.com/2011/01/isorhythmic-map-united-states-weather.html Modeling geographical variation: Latitude and Longitude http://www.enchantedlearning.com/usa/activity/latlong/ Temperature data 100 120 140 160 60 70 80 80 120 140 160 20 30 40 50 temp 40 50 80 100 longtude 20 30 latitude 20 30 40 50 60 70 80 20 30 40 50 Data examples Boxing and neurological injury Importance: (1) Boxing and sources of brain jarring may cause neurological harm. (2) In ~1986 the IOC considered replacing Olympic boxing with golf. Scientific question: Does amateur boxing lead to decline in neurological performance? Method: “Longitudinal” study of 593 amateur boxers Implications: Prevention for brain injury from subconcussive blows. Boxing data -20 -10 0 blkdiff 10 20 Lowess smoother 0 bandw idth = .8 100 200 blbouts 300 400 Data examples Temperature modeling Importance: Climate change is suspected. Heat waves, increased particle pollution, etc. may harm health. Scientific question: Can we accurately and precisely model geographic variation in temperature? Implications: Valid temperature models can support future policy planning Course objectives Demonstrate familiarity with statistical tools for characterizing population measurement properties Distinguish procedures for deriving estimates from data and making associated scientific inferences Describe “association” and describe its importance in scientific discovery Understand, apply and interpret findings from methods of data display standard statistical regression models standard statistical measurement models Appreciate roles of statistics in health science Basic paradigm of statistics We wish to learn about populations All about which we wish to make an inference “True” experimental outcomes and their mechanisms We do this by studying samples A subset of a given population “Represents” the population Sample features are used to infer population features Method of obtaining the sample is important Simple random sample: All population elements / outcomes have equal probability of inclusion Basic paradigm of statistics Probability Truth for Population Observed Value for a Representative Sample Statistical inference Tools for description Populations Samples Probability Probability Parameters Statistics / Estimates Values, distributions Data displays Hypotheses Statistical tests Models Analyses Probability Way for characterizing random experiments Experiments whose outcome is not determined beforehand Sample space: Ω := {all possible outcomes} Event = A ⊆ Ω := collection of some outcomes Probability = “measure” on Ω Our course: measure of relative frequency of occurrence “Bayesian”: measure of relative belief in occurrence Probability measures Satisfy following axioms: i) P{Ω} = 1: reads "probability of Ω" ii) 0 ≤ P{A} ≤ 1 for each A > 0 = “can’t happen”; 1 = “must happen” iii) Given disjoint events {Ak}, P{ kK1 Ak } = Σ P{Ak} > “disjoint” = “mutually exclusive”; no two can happen at the same time Random variable (RV) A function which assigns numbers to outcomes of a random experiment - X:Ω → ℝ Measurements Support:= SX = range of RV X Two fundamental types of measurements Discrete: SX is countable (“gaps” in possible values) Binary: Two possible outcomes Continuous: SX is an interval in ℝ “No gaps” in values Random variable (RV) Example 1: X = number of heads in two fair coin tosses SX = {0,1,2} Example 2: Draw one of your names out of a hat. X=age (in years) of the person whose name I draw. SX = Mass function: Probability distributions Heuristic: Summarizes possible values of a random variable and the probabilities with which each occurs Discrete X: Probability mass function = list exactly as the heuristic: p:x → P(X=x) Example = 2 fair coin tosses: P{HH} = P{HT} = P{TH} = P{TT} = ¼ Mass function: x p(x) = P(X=x) 0 ¼ 1 ½ 2 ¼ y {0,1,2} 0 Cumulative probability distributions F: x → P(X ≤ x) = cumulative distribution function CDF Discrete X: Probability mass function = list exactly as the heuristic Example = 2 fair coin tosses: x (-,0) 0 x (0,1) 1 x (1,2) 2 x (2,) F(x) 0 1/4 1/4 3/4 3/4 1 1 > draw picture of p, F p(x) 0 1/4 0 1/2 0 1/4 0 Cumulative probability distributions Example = 2 fair coin tosses: Notice: p(x) recovered as differences in values of F(x) Suppose x1≤ x2≤ … and SX = {x1, x2, …} p(xi) = F(xi) - F(xi-1), each i (define x0= -∞ and F(x0)=0) Cumulative probability distributions Draw one of your names out of a hat. X=age (in years) of the person whose name I draw What about continuous RVs? Can we list the possible values of a random variable and the probabilities with which each occurs? NO. If SX is uncountable, we can’t list the values! The CDF is the fundamental distributional quantity F(x) = P{X≤x}, with F(x) satisfying i) a ≤ b ⇒ F(a) ≤ F(b); ii) lim (b→∞) F(b) = 1; iii) lim (b→-∞) F(b) = 0; iv) lim (bn ↓ b) F(bn) = b v) P{a<X≤b} = F(b) - F(a) Two continuous CDFs “Exponential” 0.8 0.6 0.2 0.4 P(US<=score) 0.6 0.4 0.2 0.0 0.0 P(US<=score) 0.8 1.0 1.0 “Normal” 1400 1600 1800 ultrasound scores 2000 2200 0 2000 4000 6000 ultrasound scores 8000 10000 Mass function analog: Density Defined when F is differentiable everywhere (“absolutely continuous”) The density f(x) is defined as lim(ε↓0) P{X є [x-ε/2,x+ε/2]}/ε = lim(ε↓0) [F(x+ε/2)-F(x-ε/2)]/ε = d/dy F(y) |y=x Properties i) f ≥ 0 ii) P{a≤X≤b} = ∫ab f(x)dx iii) P{XεA} = ∫A f(x)dx iv) ∫-∞∞ f(x)dx = 1 Two densities “Exponential 4e-04 2e-04 3e-04 density 0.0020 0.0015 1e-04 0.0010 0.0005 0e+00 0.0000 density 0.0025 5e-04 0.0030 “Normal” 1400 1600 1800 ultrasound score 2000 2200 0 2000 4000 6000 ultrasound score 8000 10000 Probability model parameters Fundamental distributional quantities: Location: ‘central’ value(s) Spread: variability Shape: symmetric versus skewed, etc. Location and spread (Different Locations) (Different Spreads) Probability model parameters Location Mean: E[X] = ∫ xdF(x) = µ Discrete FV: E[X] = ΣxεSX xp(x) Continuous case: E[X] = ∫ xf (x)dx Linearity property: E[a+bX] = a + bE[X] Physical interpretation: Center of mass Probability model parameters Location Median Heuristic: Value so that ½ of probability weight above, ½ below Definition: median is m such that F(m) ≥ 1/2, P{X≥m} ≥ ½ Quantile ("more generally"...) Definition: Q(p) = q: FX(q) ≥ p, P{X≥q} ≥ 1-p Median = Q(1/2) Probability model parameters Spread Variance: Var[X] = ∫(x-E[X])2dF(x) = σ2 Shortcut formula: E[X2]-(E[X])2 Var[a+bX] = b2Var[X] Physical interpretation: Moment of inertia Standard deviation: SD[X] = σ = √(Var[X]) Interquartile range (IQR) = Q(.75) - Q(.25) Pause / Recapitulation We learn about populations through representative samples Probability provides a way to characterize populations Possibly unseen (models, hypotheses) Random experiment mechanisms We will now turn to the characterization of samples Formal: probability Informal: exploratory data analysis (EDA) Describing samples Empirical CDF Given data X1,...,Xn, Fn(x) = {#Xi's ≤ x}/n Define indicator 1{A}:= 1 if A true = 0 if A false ECDF = Fn = (1/n)Σ 1{Xi≤x} = probability (proportion) of values ≤ x in sample Notice is real CDF with correct properties Mass function px = 1/n if x ε {X1,...,Xn}; = 0 otherwise. Sample statistics Statistic = Function of data As defined in probability section, with F=Fn Mean = X n = ∫ xdFn(x) = (1/n) Σ Xi. 1 n Variance = s2 = n 1 1 X i X i 1 Standard deviation = s 2 Sample statistics - Percentiles “Order statistics” (sorted values): X(1) = min(X1,...,Xn) X(n) = max(X1,...,Xn) X(j) = jth largest value, etc. Median = mn = {x:Fn(x)≥1/2} and {x:PFn{X≥x}≥1/2 = X((n+1)/2) = middle if n odd; = [X(n/2)+X(n/2+1)]/2 = mean of middle two if n even Quantile Qn(p) = {x:Fn(x)≥p} and {x:PFn{X≥x}≥1-p} Outlier = data value "far" from bulk of data Describing samples - Plots Stem and leaf plot: Easy “density” display Steps Split into leading digits, trailing digits Stems: Write down all possible leading digits in order, including “might have occurred's” Leaves: For each data value, write down first trailing digit by appropriate value (one leaf per datum). Issue: # stems Chiefly science Rules of thumb: root-n, 1+3.2log10n Describing samples - Plots Boxplot Draw box whose "ends" are Q(1/4) and Q(3/4) Draw line through box at median Boxplot criterion for "outlier": beyond "inner fences" = hinges +/- 1.5*IQR Draw lines ("Whiskers") from ends of box to last points inside inner fences Show all outliers individually Note: perhaps greatest use = with multiple batches Osteoporosis data Osteo Age US Score DPA 0 58 1606 0.837 0 68 1650.25 0.841 0 53 1659.75 0.917 0 68 1662 0.975 0 54 1760.75 0.722 0 56 1770.25 0.801 0 77 1773.5 1.213 0 54 1789 1.027 0 62 1808.25 1.045 0 59 1812.5 0.988 0 72 1822.38 0.907 0 53 1826 0.971 0 61 1828 0.88 0 51 1868.5 0.898 0 61 1898.25 0.806 0 52 1908.88 0.994 0 66 1911.75 1.045 0 53 1935.75 0.869 0 62 1937.75 0.968 0 59 1946 0.957 0 50 2004.5 0.954 0 61 2043.08 1.072 Osteoporosis data Osteo Age US Score DPA 1 73 1588.66 0.785 1 63 1596.83 0.839 1 61 1608.16 0.786 1 75 1610.5 0.825 1 25 1617.75 0.916 1 64 1626.5 0.839 1 69 1658.33 1.191 1 62 1663.88 0.648 1 68 1674.8 0.906 1 58 1690.5 0.688 1 57 1695.15 0.834 1 62 1703.88 0.6 1 64 1704 0.762 1 66 1704.8 0.977 1 58 1715.75 0.704 1 70 1716.33 0.916 1 62 1739.41 0.86 1 67 1756.75 0.776 1 70 1800.75 0.799 1 42 1884.13 0.879 Introduction: Statistical Modeling Statistical models: systematic + random Probability modeling involves random part Often a few parameters “Θ” left to be estimated by data Scientific questions are expressed in terms of Θ Model is tool / lens / function for investigating scientific questions "Right" versus "wrong" misguided Better: “effective” versus “not effective” Modeling: Parametric Distributions Exponential distribution F(x) = 1-e-λx if x ≥ 0 = 0 otherwise Model parameter: λ = rate E[X] = 1/λ Var[X] = 1/λ2 Uses Time-to-event data “Memoryless” Modeling: Parametric Distributions Normal distribution f(x) = on support SX = (-∞, ∞). Distribution function has no closed form: F(x) := ∫-∞x f (t)dt, f given above F(x) tabulated, available from software packages Model parameters: μ=mean; σ2=variance Normal distribution Characteristics a) f(x) is symmetric about μ b) P{μ-σ≤X≤μ+σ} ≈ .68 c) P{μ-2σ≤X≤μ+2σ} ≈ .95 Why is the normal distribution so popular? a) If X distributed as (“~”) Normal with parameters (μ,σ) then (X-μ)/σ = “Z” ~ Normal (μ=0,σ=1) b) Central limit theorem: Distributions of sample means converge to normal as n →∞ Normal distributions Application Question: Is the normal distribution or exponential distribution a good model for ultrasound measurements in older women? If so, then comparisons between cases, controls reduce to comparisons of mean, variance Method Each model predicts the distribution of measurements ECDF Fn characterizes the distribution in our sample Compare Fn to Normal CDF with mean= 1761.43, SD=120.31 Exponential CDF with rate = 1/1761.43 Aside When is the proposed method a good idea? Need Fn to well approximate F if the sample is representative of a population distributed as F Glivenko-Cantelli theorem: Let X1, . . . ,Xn be a sequence of random variables obtained through simple random sampling from a population distributed as F. Then P(lim supx(|Fn(x) − F(x)|) = 0) = 1. Application – Two models “Exponential” 0.8 0.6 0.2 0.4 P(US<=score) 0.6 0.4 0.2 0.0 0.0 P(US<=score) 0.8 1.0 1.0 “Normal” 1400 1600 1800 ultrasound scores 2000 2200 0 2000 4000 6000 ultrasound scores 8000 10000 Application – Two models “Exponential” 0.6 0.2 0.4 P(US<=score) 0.6 0.4 0.0 0.2 0.0 P(US<=score) 0.8 0.8 1.0 1.0 “Normal” 1400 1600 1800 Ultrasound scores 2000 2200 0 2000 4000 6000 Ultrasound scores 8000 10000 12000 Main points The goal of biostatistics is to learn from biomedical data involving appreciable variability or uncertainty We do this by inferring features of populations from representative samples of them Probability is a tool for characterizing populations, samples and the uncertainty of our inferences from samples to populations Definitions Random variables Distributions Parameters: Location, spread, other Main points Describing sample distributions is a key step to making inferences about populations If the sample “is” the population: The only step ECDF, Summary statistics, data displays Models are lenses to focus questions for statistical analysis Parametric distributions Normal distribution