Introduction to Basic Statistics Statistical Inference and

advertisement
Introduction to Basic Statistics
Statistical Inference and Diagnostics
Hongxiao Zhu and Emily L. Kang
SAMSI
May 16, 2011
1/45
Basic Concepts in Probability and Statistics
Random variables
Distribution functions, density functions
Some special distributions
2/45
A random variable is a function which depends on the
outcome of some experiment: For instance, the number of
heads in three tosses of a coin
In general, the particular value a random variable will take
cannot be known for certain until the experiment is performed
However, the outcomes of the experiment may exhibit some
statistical regularity: For instance, if a fair coin is tossed many
times, we expect to see heads about half the time
The probability P of a given outcome is the proportion of
times that outcome is observed in large number of trials of the
experiment
3/45
The sample space Ω is the space of outcomes of an experiment.
For example:
Four tosses of a coin
A random draw of a number of the unit interval
a random choice of a function from the space of continuous
function on [0, 1]
4/45
A random variable X is a function from the sample space Ω into
Euclidean space Rn ; X : Ω → Rn
The domain of a random variable is the SAMPLE SPACE. Hence
X is only “random” because its domain consists of experimental
outcomes that cannot be determined in advance. Once the
outcome ω ∈ Ω is known, however, X (ω) is not random
Example: Let X =number of heads in two tosses of a coin
The sample space Ω = {HH, HT , TH, TT }
Note that X (HH) = 2, X (HT ) = 1, X (TH) = 1, X (TT ) = 0
5/45
Let X denote a random variable and x a particular value that X
might assume
We define the cumulative distribution function of X , F (x) as
F (x) = P(X ≤ x)
If X can only assume discretely many values, we define the
probability mass function p(x) as
p(x) = P(X = x)
6/45
Many random variables have a continuum of possible values. In
such cases, there may be a function f (t) so that
Z x
P(X ≤ x) =
f (t)dt
−∞
and we refer to f as the probability density function
7/45
A binomial random variable X with n trials and success probability
p takes values in the set {0, 1, 2, . . . , n} and has probability mass
function
n
P(X = k) = b(k; n, p) =
p k (1 − p)n−k
k
A Poisson random variable X with parameter λ takes values in the
nonnegative integers and has probability mass function
P(X = k) = p(k; λ) =
e −λ λk
k!
8/45
A continuous random variable is called Gaussian or normal with
parameters µ and σ if it has a density given by
(x−µ)2
1
f (x) = √ e − 2σ2
σ 2π
The
p expected value of X is µ, and the standard deviation
Var (X ) is σ
If
X
Pn 1 , . . . , Xn are i.i.d. N(µ, σ), then any linear combination
i=1 ai Xi is also normal
9/45
!"#$%&'()*'#)+',$-$'.$
!"#$%&'(")
/0-1234
*)+,-,).,
/&'012'&'3/"4,%5
+'2$-5-$2*2&"'
782&9*2&'()9"#$%)5*-*9$2$-8
:'.$-2*&'2;)<1*'2&,&.*2&"'
=;5"23$8&8)2$82&'(
6*2*)($'$-*2&'()5-".$88
>-"?*?&%&2;)#-&@$'
A.&$'2&,&.*%%;)9"2&@*2$#
2&6#%,
/6*2*4
10/45
Statistical Analysis
Statistical analysis embraces
Exploratory data analysis (EDA)
Modeling
Parameter estimation
Diagnostics
Hypothesis testing, Uncertainty quantification
11/45
In practice, we generally begin with exploratory data analysis
(EDA)
Numerical summaries of the data?
Graphical summaries of the data?
Review:
Boxplot, histogram, scatterplot
12/45
Five-number summary and boxplot
Largest = max = 6.1
BOXPLOT
Q3= third quartile
= 4.35
M = median = 3.4
Q1= first quartile
= 2.2
Smallest = min = 0.6
Five-number summary:
min Q1 M Q3 max
13/45
Histograms
The range of values that a
variable can take is divided
into equal size intervals.
The histogram shows the
number of individual data
points that fall in each
interval.
The first column represents all states with a Hispanic percent in their
14/45
Most common distribution shapes
! 
Symmetric
distribution
A distribution is symmetric if the right and left
sides of the histogram are approximately mirror
images of each other.
! 
A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
Skewed
distribution
extends much farther out than the right side.
Complex,
multimodal
distribution
! 
Not all distributions have a simple overall shape,
especially when there are few observations.
15/45
!"#$%&'()*+',-#./)%0'
$%&'()")'*(+,&-.%/-"
0,&*%&'()")'*(+,&-.%/-"
!"
!"
#"
!"
#"
!"
#"
#"
16/45
!"#$%&'%#%()*+,-
17/45
Math/Stat Model
Mathematical descriptions of a simplified system
Tools for quantifying the evidence in data about a particular
truth/relationship
Some models are explanatory, while some are descriptive
There are usually certain assumptions associated with models
There is no true model!
But the model can be useful.
18/45
Regression Model
-)+./%0)#'!"#$%&'1$2&$33")#'
45$'+)+./%0)#'&$2&$33")#'*)6$/7'
>$+$#6$#,
'?%&"%@/$'
-)+./%0)#''
8''"#,$&($+,''
-)+./%0)#
'9/)+$'
:)$;("$#,''
!"#$%&'()*+)#$#,'
A#6$+$#6$#,
'?%&"%@/$'
1%#6)*
'<&&)&
',$&*=')&
'&$3"6.%/'
1%#6)*'<&&)&'
'()*+)#$#,'
19/45
Regression Model
Regression models are often explanatory and are used for
Assessment the effect of , or relationship between,
independent variables on the dependent variable
Prediction
Assumptions:
Errors are indepedent
Errors are normally distributed
Errors have the same common mean 0 and common variance
σ2
20/45
Estimation of Parameters
For Regression models, we already know:
Ordinary least squares estimators that minimize the sum of
the squared residuals
After parameter estimation, we need to
Check whether the assumptions of the model are satisfied
(called Diagnostics of the model). If not, we need modify the
model
Make further inference: Prediction, Hypothesis test
21/45
Method of least squares to estimate β0 and β1
Recall the garbage-population example: Choose β̂0 and β̂1
to minimize
S(β0 , β1 ) =
5
�
i=1
(yi − β0 − β1 xi )2 .
!
!
200
!
150
!
!
!
!
100
Garbage
!
!
180
200
220
240
260
280
Population
22/45
�
How to solve this minimization problem?
�
Differentiate S with respect to β0 and β1 and let
∂S
∂S
= 0,
= 0.
∂β0
∂β1
�
That is
5
�
i=1
(yi − β0 − β1 xi ) = 0,
5
�
i=1
xi (yi − β0 − β1 xi ) = 0.
We call the above equations normal equations.
�
“Solve the minimization problem" ⇔ “Solve the normal
equations" !
23/45
Statistical Inference
Concepts
Confidence Interval
Hypothesis Testing
P-value
Diagnostics in Regression
24/45
Concepts
Statistical Inference means drawing conclusions about the
population based on samples drawn from that population.
Two types of inference problems:
Estimating an unknown population parameter.
Testing a hypothesis about an unknown population parameter.
Diagnostics: Check model assumptions. How do we know
our model is correct? We should check this.
25/45
Concepts
Example: Simple Linear Regression if you have noisy (x, y )
data that you think follow the pattern
y = β0 + β1 x + error ,
then you might want to
estimate β0 , β1 and measure the accuracy of these
estimations.
estimate the magnitude of the error.
predict the value of y when given a new x.
check if your assumptions about the model hold.
26/45
Confidence Interval Estimation
Assume that you have constructed a statistical model.
You have learned how to get the parameter estimation θ in
Part I.
b A function of the data
Denote the point estimation of θ is θ.
is called a Statistic. So θb is a statistic.
How accurate is this estimation?
We will construct a confidence interval [L, U], such that
P{L < θ < U} ≥ 1 − α,
for a small α (usually .05 or .10). We call [L, U] a
100(1 − α)% confidence interval (CI) of θ. Need to Estimat
L and U.
27/45
Confidence Interval Estimation
How to estimate L, U?
b we need to look at θ.
b
Since [L, U] measures accuracy of θ,
How close is θb to θ, What is the probability distribution of
b
θ?
The distribution of θb or its limiting distribution (as
n → ∞) plays the key role.
28/45
Confidence Interval Estimation
Example Two-sided CI for normally distributed data.
Consider random sample X1 , X2 , . . . , Xn from N(µ, 1)
distribution.PAssume µ is unknown. A natural estimator of µ
is X̄ = 1/n ni=1 Xi . X̄ is a function of the samples (data), it
is a statistic. We find that the distribution of this statistic is
X̄ − µ
1
√ ∼ N(0, 1).
X̄ ∼ N(µ, ) =⇒ Z =
n
1/ n
We known that
P{−zα/2 ≤ Z =
X̄ − µ
√ ≤ zα/2 } = 1 − α,
1/ n
where zα/2 is the upper α/2 quantile of N(0, 1).
29/45
Solve the inequalities inside the P{·} for µ, we get
1
1
L = X̄ − zα/2 , U = X̄ + zα/2 .
n
n
Claim that with 100(1 − α)% probability, the true µ falls in
this interval.
This describe the accuracy of our estimation. The narrower
the CI, the more accuracy we have.
30/45
CI can be one sided, or two sided.
The test statistics can be other than Normal.
31/45
In the simple regression example, confidence interval of β1 can be
obtained using the statistic
βb1 − β1
T =q
∼ tn−2 .
\
b
Var(β1 )
For any 0 < α < 1, let tn−2;1−α/2 be such that
P(T < tn−2;1−α/2 ) = 1 − α/2. Therefore,
P(−tn−2;1−α/2 ≤ T ≤ tn−2;1−α/2 ) = 1 − α.
32/45
Hypothesis Testings
A hypothesis testing is used to make decisions.
We use it to access whether the sample data support a claim
about the population.
For example, we may want to test whether θ = θ0 , or θ > 0.
It contains a claim and a counter claim, write, for example
H0 : θ = θ 0
H1 : θ 6= θ0 .
We designate the claim to be proved as H1 , the alternative
hypothesis. The competing claim is designated as H0 , the
null hypothesis.
33/45
Hypothesis Testings
Begin with the assumption that H0 is true. If the data
contradict H0 with strong evidence, then H0 is rejected, we
favor H1 . If the data fail to contradict H0 , then H0 is not
rejected.
A test statistic calculated from data is used to make this
decision.
The values of the test statistic for which the test rejects H0 is
the rejection region; the complement of the rejection region
is called the accept region. The boundaries are defined by
one or more critical constants.
34/45
Hypothesis Testings
Control two types of Errors.
Type I error: reject H0 when it is true.
Type II error: fails to reject H0 when H1 is true.
We control the type I error probability smaller than α. The α
is called Significance level.
35/45
Hypothesis Testings
Example: Consider random sample X1 , X2 , . . . , Xn from
N(µ, 1) distribution.
Assume µ is unknown. We have used
P
X̄ = 1/n ni=1 Xi to estimate µ and computed the confidence
interval. Now I want to test
H0 : µ = 5
H1 : µ 6= 5.
Use the test statistic Z =
X̄ −10
√ .
1/ n
Find zα/2 such that
P{Type I Error} = P{|Z | > zα/2 | H0 is true} ≤ α.
Reject H0 if |Z | > zα/2 .
We call this α−level test.
36/45
P-values
P-value: the smallest α−level at which the observed test
result is significant.
P{Observe a test statistics more extreme than the observed | H0 }
The smaller the P-value, the stronger the evidence against the
H0 provided by the data.
Compare P-value with the significance level α.
If P-value < α, rejecting H0 .
37/45
P-values
Example: Follow previous sample mean example. Assume we
computed X̄ = 6, n = 10. Test H0 : µ = 5 vs. µ 6= 5.
X̄ − 5
6−5
√ |≥ √ }
1/ 10
1/ 10
= P{|Z | ≥ 3.162}
P-value = P{|Z | = |
= 0.00156
Choose α = 0.05, then p-value < 0.05, H0 is rejected at
α = 0.05.
38/45
Interpret the R output
Model: Price = β0 + β1 × Mile + Error.
39/45
Diagnostics in Regression
Fitted Values: yb = βb0 + βb1 · x.
Residuals: ei = yi − ybi .
Predictions: y ∗ = βb0 + βb1 · x ∗ .
40/45
Diagnostics in Regression
Check Model Assumptions:
Residual Plots: If the model is correct, then ei vs. xi should
be randomly scattered around zero.
Plot is parabolic, indicating inadequate fitting of the model.
41/45
Diagnostics in Regression
Check for constant variance: Var(ei ) ≈ σ 2 . If the variance
is constant, the dispersion of the ei ’s is approximately
constant.
42/45
Diagnostics in Regression
Check for normality: Normal plot of ei .
43/45
Diagnostics in Regression
Check for outliers
44/45
References
Statistical Models in S, John M. Chambers and Trevor J.
Hastie.
Applied Linear Statistical Models, Michael H. Kutner,
Christopher J. Nachtsheim, John Neter and William Li.
Statistics and Data Analysis from Elementary to Intermediate,
Ajit C. Tamhane, and Dorothy D. Dunlop.
45/45
Download