Introduction to Basic Statistics Statistical Inference and Diagnostics Hongxiao Zhu and Emily L. Kang SAMSI May 16, 2011 1/45 Basic Concepts in Probability and Statistics Random variables Distribution functions, density functions Some special distributions 2/45 A random variable is a function which depends on the outcome of some experiment: For instance, the number of heads in three tosses of a coin In general, the particular value a random variable will take cannot be known for certain until the experiment is performed However, the outcomes of the experiment may exhibit some statistical regularity: For instance, if a fair coin is tossed many times, we expect to see heads about half the time The probability P of a given outcome is the proportion of times that outcome is observed in large number of trials of the experiment 3/45 The sample space Ω is the space of outcomes of an experiment. For example: Four tosses of a coin A random draw of a number of the unit interval a random choice of a function from the space of continuous function on [0, 1] 4/45 A random variable X is a function from the sample space Ω into Euclidean space Rn ; X : Ω → Rn The domain of a random variable is the SAMPLE SPACE. Hence X is only “random” because its domain consists of experimental outcomes that cannot be determined in advance. Once the outcome ω ∈ Ω is known, however, X (ω) is not random Example: Let X =number of heads in two tosses of a coin The sample space Ω = {HH, HT , TH, TT } Note that X (HH) = 2, X (HT ) = 1, X (TH) = 1, X (TT ) = 0 5/45 Let X denote a random variable and x a particular value that X might assume We define the cumulative distribution function of X , F (x) as F (x) = P(X ≤ x) If X can only assume discretely many values, we define the probability mass function p(x) as p(x) = P(X = x) 6/45 Many random variables have a continuum of possible values. In such cases, there may be a function f (t) so that Z x P(X ≤ x) = f (t)dt −∞ and we refer to f as the probability density function 7/45 A binomial random variable X with n trials and success probability p takes values in the set {0, 1, 2, . . . , n} and has probability mass function n P(X = k) = b(k; n, p) = p k (1 − p)n−k k A Poisson random variable X with parameter λ takes values in the nonnegative integers and has probability mass function P(X = k) = p(k; λ) = e −λ λk k! 8/45 A continuous random variable is called Gaussian or normal with parameters µ and σ if it has a density given by (x−µ)2 1 f (x) = √ e − 2σ2 σ 2π The p expected value of X is µ, and the standard deviation Var (X ) is σ If X Pn 1 , . . . , Xn are i.i.d. N(µ, σ), then any linear combination i=1 ai Xi is also normal 9/45 !"#$%&'()*'#)+',$-$'.$ !"#$%&'(") /0-1234 *)+,-,)., /&'012'&'3/"4,%5 +'2$-5-$2*2&"' 782&9*2&'()9"#$%)5*-*9$2$-8 :'.$-2*&'2;)<1*'2&,&.*2&"' =;5"23$8&8)2$82&'( 6*2*)($'$-*2&'()5-".$88 >-"?*?&%&2;)#-&@$' A.&$'2&,&.*%%;)9"2&@*2$# 2&6#%, /6*2*4 10/45 Statistical Analysis Statistical analysis embraces Exploratory data analysis (EDA) Modeling Parameter estimation Diagnostics Hypothesis testing, Uncertainty quantification 11/45 In practice, we generally begin with exploratory data analysis (EDA) Numerical summaries of the data? Graphical summaries of the data? Review: Boxplot, histogram, scatterplot 12/45 Five-number summary and boxplot Largest = max = 6.1 BOXPLOT Q3= third quartile = 4.35 M = median = 3.4 Q1= first quartile = 2.2 Smallest = min = 0.6 Five-number summary: min Q1 M Q3 max 13/45 Histograms The range of values that a variable can take is divided into equal size intervals. The histogram shows the number of individual data points that fall in each interval. The first column represents all states with a Hispanic percent in their 14/45 Most common distribution shapes ! Symmetric distribution A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. ! A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram Skewed distribution extends much farther out than the right side. Complex, multimodal distribution ! Not all distributions have a simple overall shape, especially when there are few observations. 15/45 !"#$%&'()*+',-#./)%0' $%&'()")'*(+,&-.%/-" 0,&*%&'()")'*(+,&-.%/-" !" !" #" !" #" !" #" #" 16/45 !"#$%&'%#%()*+,- 17/45 Math/Stat Model Mathematical descriptions of a simplified system Tools for quantifying the evidence in data about a particular truth/relationship Some models are explanatory, while some are descriptive There are usually certain assumptions associated with models There is no true model! But the model can be useful. 18/45 Regression Model -)+./%0)#'!"#$%&'1$2&$33")#' 45$'+)+./%0)#'&$2&$33")#'*)6$/7' >$+$#6$#, '?%&"%@/$' -)+./%0)#'' 8''"#,$&($+,'' -)+./%0)# '9/)+$' :)$;("$#,'' !"#$%&'()*+)#$#,' A#6$+$#6$#, '?%&"%@/$' 1%#6)* '<&&)& ',$&*=')& '&$3"6.%/' 1%#6)*'<&&)&' '()*+)#$#,' 19/45 Regression Model Regression models are often explanatory and are used for Assessment the effect of , or relationship between, independent variables on the dependent variable Prediction Assumptions: Errors are indepedent Errors are normally distributed Errors have the same common mean 0 and common variance σ2 20/45 Estimation of Parameters For Regression models, we already know: Ordinary least squares estimators that minimize the sum of the squared residuals After parameter estimation, we need to Check whether the assumptions of the model are satisfied (called Diagnostics of the model). If not, we need modify the model Make further inference: Prediction, Hypothesis test 21/45 Method of least squares to estimate β0 and β1 Recall the garbage-population example: Choose β̂0 and β̂1 to minimize S(β0 , β1 ) = 5 � i=1 (yi − β0 − β1 xi )2 . ! ! 200 ! 150 ! ! ! ! 100 Garbage ! ! 180 200 220 240 260 280 Population 22/45 � How to solve this minimization problem? � Differentiate S with respect to β0 and β1 and let ∂S ∂S = 0, = 0. ∂β0 ∂β1 � That is 5 � i=1 (yi − β0 − β1 xi ) = 0, 5 � i=1 xi (yi − β0 − β1 xi ) = 0. We call the above equations normal equations. � “Solve the minimization problem" ⇔ “Solve the normal equations" ! 23/45 Statistical Inference Concepts Confidence Interval Hypothesis Testing P-value Diagnostics in Regression 24/45 Concepts Statistical Inference means drawing conclusions about the population based on samples drawn from that population. Two types of inference problems: Estimating an unknown population parameter. Testing a hypothesis about an unknown population parameter. Diagnostics: Check model assumptions. How do we know our model is correct? We should check this. 25/45 Concepts Example: Simple Linear Regression if you have noisy (x, y ) data that you think follow the pattern y = β0 + β1 x + error , then you might want to estimate β0 , β1 and measure the accuracy of these estimations. estimate the magnitude of the error. predict the value of y when given a new x. check if your assumptions about the model hold. 26/45 Confidence Interval Estimation Assume that you have constructed a statistical model. You have learned how to get the parameter estimation θ in Part I. b A function of the data Denote the point estimation of θ is θ. is called a Statistic. So θb is a statistic. How accurate is this estimation? We will construct a confidence interval [L, U], such that P{L < θ < U} ≥ 1 − α, for a small α (usually .05 or .10). We call [L, U] a 100(1 − α)% confidence interval (CI) of θ. Need to Estimat L and U. 27/45 Confidence Interval Estimation How to estimate L, U? b we need to look at θ. b Since [L, U] measures accuracy of θ, How close is θb to θ, What is the probability distribution of b θ? The distribution of θb or its limiting distribution (as n → ∞) plays the key role. 28/45 Confidence Interval Estimation Example Two-sided CI for normally distributed data. Consider random sample X1 , X2 , . . . , Xn from N(µ, 1) distribution.PAssume µ is unknown. A natural estimator of µ is X̄ = 1/n ni=1 Xi . X̄ is a function of the samples (data), it is a statistic. We find that the distribution of this statistic is X̄ − µ 1 √ ∼ N(0, 1). X̄ ∼ N(µ, ) =⇒ Z = n 1/ n We known that P{−zα/2 ≤ Z = X̄ − µ √ ≤ zα/2 } = 1 − α, 1/ n where zα/2 is the upper α/2 quantile of N(0, 1). 29/45 Solve the inequalities inside the P{·} for µ, we get 1 1 L = X̄ − zα/2 , U = X̄ + zα/2 . n n Claim that with 100(1 − α)% probability, the true µ falls in this interval. This describe the accuracy of our estimation. The narrower the CI, the more accuracy we have. 30/45 CI can be one sided, or two sided. The test statistics can be other than Normal. 31/45 In the simple regression example, confidence interval of β1 can be obtained using the statistic βb1 − β1 T =q ∼ tn−2 . \ b Var(β1 ) For any 0 < α < 1, let tn−2;1−α/2 be such that P(T < tn−2;1−α/2 ) = 1 − α/2. Therefore, P(−tn−2;1−α/2 ≤ T ≤ tn−2;1−α/2 ) = 1 − α. 32/45 Hypothesis Testings A hypothesis testing is used to make decisions. We use it to access whether the sample data support a claim about the population. For example, we may want to test whether θ = θ0 , or θ > 0. It contains a claim and a counter claim, write, for example H0 : θ = θ 0 H1 : θ 6= θ0 . We designate the claim to be proved as H1 , the alternative hypothesis. The competing claim is designated as H0 , the null hypothesis. 33/45 Hypothesis Testings Begin with the assumption that H0 is true. If the data contradict H0 with strong evidence, then H0 is rejected, we favor H1 . If the data fail to contradict H0 , then H0 is not rejected. A test statistic calculated from data is used to make this decision. The values of the test statistic for which the test rejects H0 is the rejection region; the complement of the rejection region is called the accept region. The boundaries are defined by one or more critical constants. 34/45 Hypothesis Testings Control two types of Errors. Type I error: reject H0 when it is true. Type II error: fails to reject H0 when H1 is true. We control the type I error probability smaller than α. The α is called Significance level. 35/45 Hypothesis Testings Example: Consider random sample X1 , X2 , . . . , Xn from N(µ, 1) distribution. Assume µ is unknown. We have used P X̄ = 1/n ni=1 Xi to estimate µ and computed the confidence interval. Now I want to test H0 : µ = 5 H1 : µ 6= 5. Use the test statistic Z = X̄ −10 √ . 1/ n Find zα/2 such that P{Type I Error} = P{|Z | > zα/2 | H0 is true} ≤ α. Reject H0 if |Z | > zα/2 . We call this α−level test. 36/45 P-values P-value: the smallest α−level at which the observed test result is significant. P{Observe a test statistics more extreme than the observed | H0 } The smaller the P-value, the stronger the evidence against the H0 provided by the data. Compare P-value with the significance level α. If P-value < α, rejecting H0 . 37/45 P-values Example: Follow previous sample mean example. Assume we computed X̄ = 6, n = 10. Test H0 : µ = 5 vs. µ 6= 5. X̄ − 5 6−5 √ |≥ √ } 1/ 10 1/ 10 = P{|Z | ≥ 3.162} P-value = P{|Z | = | = 0.00156 Choose α = 0.05, then p-value < 0.05, H0 is rejected at α = 0.05. 38/45 Interpret the R output Model: Price = β0 + β1 × Mile + Error. 39/45 Diagnostics in Regression Fitted Values: yb = βb0 + βb1 · x. Residuals: ei = yi − ybi . Predictions: y ∗ = βb0 + βb1 · x ∗ . 40/45 Diagnostics in Regression Check Model Assumptions: Residual Plots: If the model is correct, then ei vs. xi should be randomly scattered around zero. Plot is parabolic, indicating inadequate fitting of the model. 41/45 Diagnostics in Regression Check for constant variance: Var(ei ) ≈ σ 2 . If the variance is constant, the dispersion of the ei ’s is approximately constant. 42/45 Diagnostics in Regression Check for normality: Normal plot of ei . 43/45 Diagnostics in Regression Check for outliers 44/45 References Statistical Models in S, John M. Chambers and Trevor J. Hastie. Applied Linear Statistical Models, Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Statistics and Data Analysis from Elementary to Intermediate, Ajit C. Tamhane, and Dorothy D. Dunlop. 45/45