Statistical Inference Chapter 6

advertisement
Chapter 6
Statistical Inference
From now on, we will use probability theory only to find answers to the questions arising from specific
problems we are working on.
In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the
average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample,
i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals.
Using some probabilistic arguments we can then extend the information we got from that sample and make
an estimate of the characteristic for the whole population. Probability theory will give us the means to find
those estimates and measure, how “probable” our estimates are.
Of course, choosing the sample, is crucial. We will demand two properties from a sample:
• the sample should be representative - taking only basketball players into the sample would change our
estimate about a person’s height drastically.
• if there’s a large number in the sample we should come close to the “true” value of the characteristic
The three main area of statistics are
• estimation of parameters:
point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval
(a, b)”
• evaluation of plausibility of values: hypothesis testing
• prediction of future (individual) values
6.1
Parameter Estimation
Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they
turn out to be close to the actual weather), taxes, evaluations . . .
The most basic form of statistics are descriptive statistics.
But - what exactly is a statistic? - Here is the formal definition:
Definition 6.1.1 (Statistics)
Any function W (x1 , . . . , xk ) of observed values x1 , . . . , xk is called a statistic.
Some statistics you already know are:
81
82
CHAPTER 6. STATISTICAL INFERENCE
!
X̄ = n1 i Xi
X(1) - Parentheses indicate that the values are sorted
X(n)
X(n) − X(1)
value(s) that appear(s) most often
“middle value” - that value, for which one half of the data is larger,
the other half is smaller. If n is odd the median is X(n/2) , if n is
even, the median is the average of the two middle values: 0.5 ·
X((n−1)/2) + 0.5 · X((n+1)/2)
For this section it is important to distinguish between xi and Xi properly. If not stated otherwise, any capital
letter denotes some random variable, a small letter describes a realization of this random variable, i.e.
what we have observed. xi therefore is a real number, Xi is a function, that assigns a real number to an
event from the sample space.
Mean (Average)
Minimum
Maximum
Range
Mode
Median
Definition 6.1.2 (estimator)
Let X1 , . . . , Xk be k i.i.d random variables with distribution Fθ with (unknown) parameter θ.
A statistic Θ̂ = Θ̂(X1 , . . . , Xk ) used to estimate the value of θ is called an estimator of θ.
θ̂ = Θ̂(x1 , . . . , xk ) is called an estimate of θ.
Desirable properties of estimates:
x true value
• value for from one sample
Unbiasedness:
•
E[Θ̂] = θ
•
•
and not
•
•
•
•
•
•
•
••
•
• ••
•
•
•
x
Efficiency:
estimator 1
•
•• •
• ••
• x• •
• •
•
is better than
•
•
••
•
• x• •
•
• •
•
•
estimator 2
Consistency:
•
• Consistency, if we have a larger sample size n, we want
the estimate θ̂ to be closer to the true parameter θ.
•
•
•
• Efficiency, for two estimators, Θ̂1 and Θ̂2 of the same
parameter θ, Θ̂1 is said to be more efficient than Θ̂2 ,
if
V ar[Θ̂1 ] < V ar[Θ̂2 ]
•
•
x
• •
•
•
•
•
•
•
•
•
• x• •
•
lim P (|Θ̂ − θ| > ") = 0
n→∞
same estimator
for n = 100
•
•• •
• x••
• ••
• •
•
• Unbiasedness, i.e the expected value of an estimator is
the true parameter:
for n = 10000
Example 6.1.1
Let X1 , . . . , Xn be n i.i.d. random variables with E[Xi ] = µ.
!n
Then X̄ = n1 i=1 Xi is an unbiased estimator of µ, because
n
E[X̄] =
1
1"
E[Xi ] = n · µ = µ.
n i=1
n
ok, so, once we have an estimator, we can decide, whether it has the properties. But how do we find
estimators?
83
6.1. PARAMETER ESTIMATION
6.1.1
Maximum Likelihood Estimation
Situation: We have n data values x1 , . . . , xn . The assumption is, that these data values are realizations of n
i.i.d. random variables X1 , . . . , Xn with distribution Fθ . Unfortunately the value for θ is unknown.
X
observed values
x1, x2, x3, ...
f
with
=0
f
with
= -1.8
f
with
=1
By changing the value for θ we can “move the density function fθ around” - in the diagram, the third density
function fits the data best.
Principle: since we do not know the true value θ of the distribution, we take that value θ̂ that most likely
produced the observed values, i.e.
maximize something like
P (X1 = x1 ∩ X2 = x2 ∩ . . . ∩ Xn = xn )
=
n
#
Xi are independent!
=
= P (X1 = x1 ) · P (X2 = x2 ) · . . . · P (Xn = xn ) =
P (Xi = xi )
(*)
i=1
This is not quite the right way to write the probability, if X1 , . . . , Xn are continuous variables. (Remember:
P (X = x) = 0 for a continuous variable X; this is still valid)
We use the above “probability” just as a plausibility argument. To come around the problem that P (X =
x) = 0 for a continuous variable, we will write (*) as:
n
#
i=1
$
pθ (xi )
%&
and
'
for discreteXi
n
#
i=1
$
fθ (xi )
%&
'
for continuousXi
where pθ ) is the probability mass function of discrete Xi (all Xi have the same, since they are i.d) and fθ is
the density function of continuous Xi .
Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This
function, which we will denote by L(θ), is called the Likelihood function of X1 , . . . , Xn .
The goal is now, to find a value θ̂ that maximizes the Likelihood function. (this is what “moves” the density
to the right spot, so it fits the observed values well)
How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and
set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a
maximum, but we won’t do that unless we’ve found more than one possible value for θ̂.)
Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum
for log L(θ), the Log-Likelihood function.
Note: though its name is “log”, we use the natural logarithm ln.
The plan to find an ML-estimator is:
1. Find Likelihood function L(θ).
2. Get natural log of Likelihood function log L(θ).
3. Differentiate log-Likelihood function with respect to θ.
84
CHAPTER 6. STATISTICAL INFERENCE
4. Set derivative to zero.
5. Solve for θ.
Example 6.1.2 Roll a Die
A die is rolled until its face shows a 6.
repeating this experiment 100 times gave the following results:
#Rolls of a Die until first 6
20
15
# runs
10
5
0
1
k
# trials
1
18
2
20
3
8
4
9
2
5
9
3
4
5
6
6
5
7
8
7
8
9
11
8
3
9
5
14
11
3
16
14
3
20
15
3
27
16
1
17
1
20
1
29
21
1
27
1
29
1
We know, that k the number of rolls until a 6 shows up has a geometric distribution Geop . For a fair die, p
is 1/6.
The Geometric distribution has probability mass function p(k) = (1 − p)k−1 · p.
What is the ML-estimate p̂ for p?
1. Likelihood function L(p):
Since we have observed 100 outcomes k1 , ..., k100 , the likelihood function L(p) =
L(p) =
100
#
i=1
(1 − p)ki −1 p = p100 ·
100
#
i=1
P100
(1 − p)ki −1 = p100 · (1 − p)
i=1 (ki −1)
)
*
P100
log p100 · (1 − p) i=1 ki −100 =
)
*
P100
+
,
= log p100 + log (1 − p) i=1 ki −100 =
- 100
.
"
= 100 log p +
ki − 100 log(1 − p).
=
i=1
i=1
p(ki ),
P100
= p100 · (1 − p)
2. log of Likelihood function log L(p):
log L(p)
(100
i=1
ki −100
.
85
6.1. PARAMETER ESTIMATION
3. Differentiate log-Likelihood with respect to p:
d
log L(p)
dp
' 100
(
=
1
100 +
p
=
1
p(1 − p)
1
p(1 − p)
=
)
−1
=
1−p
i=1
'
' 100
) )
(
100(1 − p) −
ki − 100 p =
ki − 100
i=1
'
100 − p
100
(
ki
i=1
)
.
4. Set derivative to zero.
For the estimate p̂ the derivative must be zero:
⇐⇒
d
log L(p̂) = 0
dp
'
)
100
(
100 − p̂
ki
= 0
1
p̂(1 − p̂)
i=1
5. Solve for p̂.
1
p̂(1 − p̂)
'
100 − p̂
100
(
100
p̂ = *100
i=1
In total, we have an estimate p̂ =
100
568
ki
=
1
100
= 0.1710.
=
0
ki
=
0
ki
i=1
100 − p̂
)
100
(
i=1
1
*100
i=1
ki
.
Example 6.1.3 Red Cars in the Parking Lot
The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the
parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks.
The assumption is, that these values are realizations of ten independent Poisson variables with (the same)
rate λ.
What is the Maximum Likelihood estimate of λ?
x
The probability mass function of a Poisson distribution is pλ (x) = e−λ · λx! .
We have ten values xi , this gives a Likelihood function:
L(λ) =
10
"
i=1
e−λ ·
10
"
P10
λXi
1
= e−10λ · λ i=1 Xi ·
Xi !
X
i!
i=1
The log-Likelihood then is
log L(λ) = −10λ + ln(λ) ·
10
(
i=1
Xi −
(
ln(Xi ).
86
CHAPTER 6. STATISTICAL INFERENCE
Differentiating the log-Likelihood with respect to λ gives:
10
d
1 (
log L(λ) = −10 + ·
Xi
dλ
λ i=1
Setting it to zero:
10
1 (
·
Xi = 10
λ̂ i=1
10
⇐⇒ λ̂ =
1 (
Xi
10 i=1
29
= 2.9
10
This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can
say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40
pm is 2.9.
⇐⇒ λ̂ =
ML-estimators for µ and σ 2 of a Normal distribution
Let X1 , . . . , Xn be n independent, identically distributed normal variables with E[Xi ] = µ and V ar[Xi ] = σ 2 .
µ and σ 2 are unknown.
The normal density function fµ,σ2 is
fµ,σ2 (x) = √
1
2πσ 2
e−
(x−µ)2
2σ 2
Since we have n independent variables, the Likelihood function is a product of n densities:
L(µ, σ 2 ) =
n
"
i=1
Log-Likelihood:
√
1
2πσ 2
e−
log L(µ, σ 2 ) = −
(xi −µ)2
2σ 2
= (2πσ 2 )n/2 · e−
Pn
i=1
(xi −µ)2
2σ 2
n
n
1 (
ln(2πσ 2 ) − 2
(xi − µ)2
2
2σ i=1
Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood:
d
log L(µ, σ 2 )
dµ
d
log L(µ, σ 2 )
dσ 2
= 0−2·
n
n
−1 (
1 (
2
·
(x
−
µ)
·
(−1)
=
(xi − µ)2
i
2σ 2 i=1
σ 2 i=1
n
n 1
1 (
(xi − µ)2
= −
+
2 σ2
2(σ 2 )2 i=1
We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time.
d
Setting dµ
log L(µ, σ 2 ) = 0 gives
n
1(
xi ,
µ̂ =
n i=1
plugging this value into the derivative for σ 2 and setting
n
d
dσ 2
log L(µ̂, σ 2 ) = 0 gives
1(
σˆ2 =
(xi − µ̂)2
n i=1
87
6.2. CONFIDENCE INTERVALS
6.2
Confidence intervals
The previous section has provided a way to compute point estimates for parameters. Based on that, our
next question is - how good is this point estimate? or How close is the estimate to the true value of the
parameter?
Instead of just looking at the point estimate, we will now try to compute an interval around the estimated
parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence
interval.
Definition 6.2.1 (Confidence Interval)
Let θ̂ be an estimate of θ.
If P (|θ̂ − θ| < e) > α, we say, that the interval (θ̂ − e, θ̂ + e) is an α · 100% Confidence interval of θ (cf. fig.
6.1).
Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc.
Note:
• for any given set of values x1 , . . . , xn the value or θ̂ is fixed, as well as the interval (θ̂ − e, θ̂ + e).
• The true value θ is either within the confidence interval or not.
prob
P(
1-
x-e
x
-e
x+e
-e< x <
e
prob
+ e) >
1-
confidence interval for
Figure 6.1: The probability that x̄ falls into an e interval around µ is α. Vice versa, we know, that for all of
those x̄ µ is within an e interval around x̄. That’s the idea of a confidence interval.
!!DON’T DO!!
A lot of people are tempted to reformulate the above probability to:
P (θ̂ − e < θ < θ̂ + e) > α
Though it looks ok, it’s not. Repeat: IT IS NOT OK.
θ is a fixed value - therefore, it does not have a probability to fall into some interval.
The only probability that we have, here, is
P (θ − e < θ̂ < θ + e) > α,
we can therefore say, that θ̂ has a probability of at least α to fall into an e- interval around θ. Unfortunately,
that doesn’t help at all, since we do not know θ!
How do we compute confidence intervals, then? - that’s different for each estimator.
First, we look at estimates of a mean of a distribution:
88
6.2.1
CHAPTER 6. STATISTICAL INFERENCE
Large sample C.I. for µ
Situation: we have a large set of observed values (n > 30, usually).
The assumption is, that these values are realizations of n i.i.d random variables X1 , . . . , Xn with E[X̄] = µ
and V ar[X̄] = σ 2 .
We already know from the previous section, that X̄ is an unbiased ML-estimator for µ.
But we know more! - The CLT tells us, that in exactly the situation we are X̄ is an approximately normal
2
distributed random variable with E[X̄] = µ and V ar[X̄] = σn .
We therefore can find the boundary e by using the standard normal distribution. Remember: if X̄ ∼
X̄−µ
√ ∼ N (0, 1) = Φ:
N (µ, σ 2 /n) then Z := σ/
n
P (|X̄ − µ| ≤ e) ≥ α
use standardization
$
#
|X̄ − µ|
e
√ ≤
√
≥α
⇐⇒ P
σ/ n
σ/ n
#
$
e
√
⇐⇒ P |Z| <
≥α
σ/ n
#
$
e
e
√
⇐⇒ P − √ < Z <
≥α
σ/ n
σ/ n
#
$
#
$
e
e
√
⇐⇒ Φ
−Φ − √
≥α
σ/ n
σ/ n
#
$ #
#
$$
e
e
√
√
⇐⇒ Φ
− 1−Φ
≥α
σ/ n
σ/ n
#
$
e
√
⇐⇒ 2Φ
−1≥α
σ/ n
$
#
α
e
√
≥1+
⇐⇒ Φ
2
σ/ n
$
#
1
+
α
e
−1
√ ≥Φ
⇐⇒
2
σ/ n
#
$
1+α
σ
√
⇐⇒ e ≥ Φ−1
2
n
%
&'
(
:=z
This computation gives a α· 100% confidence value around µ as:
$
#
σ
σ
X̄ − z · √ , X̄ + z · √
n
n
Now we can do an example:
Example 6.2.1
Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee.
A random sample of 100 ISU employees gives us a sample mean salary of $21543 = x̄.
Suppose, the standard deviation of salaries is known to be $3000.
By using the above expression, we get a 95% confidence interval as:
#
$
1 + 0.95
3000
−1
21543 ± Φ
·√
= 21543 ± Φ−1 (0.975) · 300
2
100
How do we read Φ−1 (0.975) from the standard normal table? - We look for which z the probability N(0,1) (z) ≥
0.975!
89
6.2. CONFIDENCE INTERVALS
This gives us z = 1.96, the 95% confidence interval is then:
21543 ± 588,
i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100
studies, the true parameter µ falls into a $588 range around x̄.
Critical values for z, depending on α are:
α
0.90
0.95
0.98
0.99
z = Φ−1 ( 1+α
2 )
1.65
1.96
2.33
2.58
Problem: Usually, we do not know
' σ(
n
1
2
Slight generalization: use s = n−1
i=1 (Xi − X̄) instead of σ!
An α· 100% confidence interval for µ is given as
"
!
s
s
X̄ − z · √ , X̄ + z · √
n
n
where z = Φ−1 ( 1+α
2 ).
Example 6.2.2
Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory.
We are interested in the mean queue length of the system after reaching steady state.
The only thing possible for us is to run simulations of this system and look at the queue length at some large
time t, e.g. t = 1000 hrs.
After 50 simulations, we have got data:
X1 = number in queue at time 1000 hrs in 1st simulation
X2 = number in queue at time 1000 hrs in 2nd simulation
...
X50 = number in queue at time 1000 hrs in 50th simulation
'
(n
1
2
Our observations yield an average queue length of x̄ = 21.5 and s = n−1
i=1 (xi − x̄) = 15.
A 90% confidence interval is given as
"
!
s
s
x̄ − z · √ , x̄ + z · √
n
n
!
15
15
=
21.5 − 1.65 · √ , 21.5 + 1.65 · √
50
50
= (17.9998, 25.0002)
"
=
Example 6.2.3
The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green
framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these
are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed
(that’s x̄) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part
c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower
confidence bound together with the sample mean are drawn in red next to the sampled observations.
90
CHAPTER 6. STATISTICAL INFERENCE
a) 95 % confidence intervals
b) 95 % confidence intervals
c) 90 % confidence intervals
d) 99 % confidence intervals
There are several things to see from this diagram. First of all, we know in this example the “true” value of
the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true
parameter is represented by the straight horizontal line through 0.
91
6.2. CONFIDENCE INTERVALS
We see, that each sample yields a different confidence interval, all of the are centered around the sample
mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we
had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different
standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b).
The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the
intervals in part d) are on average larger than the first ones, they are 99% confidence intervals.
Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence
interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter.
When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true
parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ.
Official use of Confidence Intervals:
In an average of 90 out of 100 times the 90% confidence interval of θ does contain
the true value of θ.
6.2.2
Large sample confidence intervals for a proportion p
Let p be a proportion of a large population or a probability.
In order to get an estimate for this proportion, we can take a sample of n individuals from the population
and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest.
Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of
“successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting
subgroup.
X then has a Binomial distribution, with parameters n and p. We know, that X̄ is an estimate for E[X].
Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate p̂ for
p as p̂ = n1 X̄.
Furthermore, we even have a distribution for p̂ for large n: Since X̂ is, using the CLT, a normal variable
with E[X̄] = np and V ar[X̄] = np(1 − p), we get that for large n p̂ is a approximately normally distributed
with E[p̂] = p and V ar[p̂] = p(1−p)
.
n
BTW: this tells us, that p̂ is an unbiased estimator of p.
Prepared with the distribution of p̂ we can set up an α · 100% confidence interval as:
where e is some positive real number with:
(p̂ − e, p̂ + e)
P (|p̂ − p| ≤ e) ≥ α
We can derive the expression for e in the same way as in the previous section and we come up with:
e=z·
p(1 − p)
n
where z = Φ−1 ( 1+α
2 ).
We also run into the problem that e in this form is not ready for use, since we do not know the value for p.
In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or
we can substitute an appropriate value for p.
6.2.2.1
Conservative Method:
replace p(1 − p) by something that’s guaranteed to be at least as large.
The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then
0.25.
p
92
CHAPTER 6. STATISTICAL INFERENCE
The conservative α · 100% confidence interval for p is
1
p̂ ± z · √
2 n
where z = Φ−1 ( 1+α
2 ).
6.2.2.2
Substitution Method:
Substitute p̂ for p, then:
The α · 100% confidence interval for p by substitution is
!
p̂(1 − p̂)
p̂ ± z ·
n
where z = Φ−1 ( 1+α
2 ).
Where is the difference between the two methods?
• for large n there is almost no difference at all
• if p̂ is close to 0.5, there is also almost no difference
Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found
by substitution. However, they are at the same time easier to compute.
Example 6.2.4 Complicated queueing system, continued
Suppose, that now we are interested in the large t probability p that a server is available.
Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs.
What is a 95% confidence interval for this probability?
60
If 60 out of 100 simulations showed a free server, we can use p̂ = 100
= 0.6 as an estimate for p.
−1
For a 95% confidence interval, z = Φ (0.975) = 1.96.
The conservative confidence interval is:
1
1
p̂ ± z √ = 0.6 ± 1.96 √
= 0.6 ± 0.098.
2 n
2 · 100
For the confidence interval using substitution we get:
!
!
p̂(1 − p̂
0.6 · 0.4
= 0.6 ± 1.96
= 0.6 ± 0.096.
p̂ ± z
n
100
Example 6.2.5 Batting Average
In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is
the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season.
Could the ”true” batting average still be 0.300?
Compute a 95% Confidence Interval for the true batting average.
Conservative Method gives:
1
0.288 ± 1.96 · √
2 555
0.288 ± 0.042
93
6.2. CONFIDENCE INTERVALS
Substitution Method gives:
0.288 ± 1.96 ·
0.288 ± 0.038
!
0.288(1 − 0.288)
555
The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is
not enough evidence to allow the conclusion that the true average is not 0.3.
Confidence intervals give a way to measure the precision we get from simulations intended to evaluate
probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a
desired precision.
Example 6.2.6
Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income
over $35 K.
We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01.
this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval
for ease of computation):
e ≤ 0.01
1
z is 2.33
⇐⇒ z · √ ≤ 0.01
2 n
1
⇐⇒ 2.33 · √ ≤ 0.01
2 n
√
2.33
⇐⇒ n ≥
= 116.5
2 · 0.01
⇒ n ≥ 13573
6.2.3
Related C.I. Methods
Related to the previous confidence intervals, are confidence intervals for the difference between two means,
µ1 − µ2 , or the difference between two proportions , p1 − p2 .
Confidence intervals for these differences are given as:
large n confidence interval for µ1 − µ2
(based on independent X̄1 and X̄2 )
X̄1 − X̄2 ± z
"
s21
n1
+
s22
n2
large n confidence interval for p1 − p2
(based on independent p̂1 and p̂2 )
p̂1 − p̂2 ± z 21
"
or p̂1 − p̂2 ± z
stitution)
1
"n1
+
1
n2
p̂1 (1−p̂1 )
n1
(conservative)
+
p̂2 (1−p̂2 )
n2
(sub-
Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for
the difference between means.
X̄1 − X̄2 is approximately normal, since X̄1 and X̄2 are approximately normal, with (X̄1 , X̄2 are independent)
E[X̄1 − X̄2 ] = E[X̄1 ] − E[X̄2 ] = µ1 − µ2
V ar[X̄1 − X̄2 ]
= V ar[X̄1 ] + (−1)2 V ar[X̄2 ] =
σ2
σ2
+
n1
n2
94
CHAPTER 6. STATISTICAL INFERENCE
Then we can use the same arguments as before and get a C.I. for µ1 − µ2 as shown above.
!
Example 6.2.7
Assume, we have two parts of the IRS database: East Coast and West Coast.
We want to compare the mean taxable income between reported from the two regions in 2000.
East Coast
West Coast
# of sampled records:
n1 = 1000
n2 = 2000
mean taxable income: x̄1 = $37200 x̄2 = $42000
standard deviation: s1 = $10100 s2 = $15600
We can, for example, compute a 2 sided 95% confidence interval for µ1 − µ2 = difference in mean taxable
income as reported from 2000 tax return between East and West Coast:
!
101002
156002
+
= −5000 ± 927
37000 − 42000 ±
1000
2000
Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East
Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained
the 0, the message wouldn’t be so clear.
One-sided intervals
idea: use only one of the end points x̄ ± z √sn
This yields confidence intervals for µ of the form
(−∞, #)
# $% &
(##, ∞)
# $% &
upper bound
lower bound
However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal
distribution, we use for a one sided confidence interval only one tail.
P( x <
x
+ e) <
e
x+e
prob ≤ 1 -
confidence interval for
Figure 6.2: One sided (upper bounded) confidence interval for µ (in red).
Example 6.2.8 complicated queueing system, continued
What is a 95% upper confidence bound of µ, the parameter for the length of the queue?
−1
x̄ + z √sn is the upper confidence bound. Instead of z = Φ−1 ( α+1
(α) (see fig. 6.2).
2 ) we use z = Φ
This gives: 21.5 + 1.65 √1550 = 25.0
as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0).
95
6.3. HYPOTHESIS TESTING
Critical values z = Φ−1 (α) for the one sided confidence interval are
α
z = Φ−1 (α)
0.90
1.29
0.95
1.65
0.98
2.06
2.33
0.99
Example 6.2.9
Two different digital communication systems send 100 large messages via each system and determine how
many are corrupted in transmission.
p̂1 = 0.05 and pˆ2 = 0.10.
What’s the difference in the corruption rates? Find a 98% confidence interval:
Use:
0.05 − 0.1 ± 2.33 ·
!
0.05 · 0.95 0.10 · 0.90
+
= −0.05 ± 0.086
100
100
This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of
p1 − p2 , i.e. we can’t tell which of the pi s is larger.
So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is,
that the normal approximation in the CLT doesn’t work, if the standard deviation σ 2 is unknown.
What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes.
6.3
Hypothesis Testing
Example 6.3.1 Tea Tasting Lady
It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put
in first or the tea was put in first.
To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case
whether the milk went in first or the tea went in first.
To guard against deliberate or accidental communication of information, before pouring each cup of tea a
coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup
of tea to the lady does not know the outcome of the coin toss.
Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she
is simply guessing.
Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right.
This looks rather suspicious, the lady seems to have some skill. But how can we check it?
We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all,
the probability she gives a correct answer for any single cup of tea is 1/2.
The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5.
The diagram shows the probability mass function of this distribution:
96
CHAPTER 6. STATISTICAL INFERENCE
p(x)
observed x
x
Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but
nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”, but it would certainly be evidence against her guessing).
' (
The total probability for these events is (remember, the binomial probability mass function is p(x) = nx px (1−
p)n−x )
p(0) + p(1) + p(9) + p(10) = 0.510 + 10 · 0.510 + 10 · 0.510 + 0.510 = 0.021
i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing.
This suggests, that the lady may have some skill in detecting which was poured first into the cup.
Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5.
The fact that the p-value is small is evidence against the hypothesis.
Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can
be rejected based on the data.
We are going to abstract the main elements of the previous example and cook up a standard series of steps
for hypothesis testing:
Example 6.3.2
University CC administrators have historical records that indicate that between August and Oct 2002 the
mean time between hits on the ISU homepage was 2 per min.
They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50
inter-arrival times from records for November 2002 gives: X̄ = 1.7 min and s = 1.9 min.
Is this strong evidence for an increase in traffic?
97
6.3. HYPOTHESIS TESTING
1
2
3
4
Formal Procedure
Application to Example
State a “null hypothesis” of the form H0 : function of parameter(s) = #
meant to embody a status quo/ pre data view
State an “alternative hypothesis”
 of the form
 >
(= #
Ha : function of parameter(s)

<
meant to identify departure from H0
State test criteria - consists of a test statistic,
a “reference distribution” giving the behavior
of the test statistic if H0 is true and the kinds
of values of the test statistic that count as evidence against H0 .
show computations
H0 : µ = 2.0 min between hits
5
Report and interpret a p-value = “observed
level of significance, with which H0 can be rejected”. This is the probability of an observed
value of the test statistic at least as extreme as
the one at hand. The smaller this value is, the
less likely it is that H0 is true.
Note aside: a 90% confidence interval for µ is
Ha : µ < 2 (traffic is down)
√
test statistic will be Z = X̄−2.0
s/ n
The reference density will be standard normal,
large negative values for Z count as evidence
against H0 in favor of Ha
sample gives z =
1.7−2.0
√
1.9/ 50
= −1.12
The p-value is P (Z ≤ −1.12) = Φ(−1.12) =
0.1314
This value is not terribly small - the evidence of
a decrease in mean time between hits is somewhat weak.
s
x̄ ± 1.65 √ = 1.7 ± 0.44
n
This interval contains the hypothesized value of µ = 2.0
There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two
means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an
overview of the tests, we are going to use:
Hypothesis
Statistic
Reference Distribution
√
H0 : µ = #
Z = X̄−#
Z
is standard normal
s/ n
where p̂ =
H0 : p = #
Z=
H0 : µ1 − µ2 = #
Z=
H 0 : p1 − p2 = #
Z=√
n1 p̂1 +n2 p̂2
n1 +n2 .
q p̂−#
Z is standard normal
#(1−#)
n
X̄
1 −X̄2 −#
r
s2
1
n1
s2
p̂(1−p̂)
1
n1
+ n2
2
p̂1 −p̂q
2 −#
Z is standard normal
+ n1
Z is standard normal
2
Example 6.3.3 tax fraud
Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their
tax returns that invite criminal prosecution.
A sample of n = 1000 tax returns produces p̂ = 0.061 as an estimate of the fraction of fraudulent returns. does this provide a clear signal of change in the tax payer behavior?
1. state null hypothesis: H0 : p = 0.05
2. alternative hypothesis: Ha : p (= 0.05
CHAPTER 6. STATISTICAL INFERENCE
3. test statistic:
p̂ − 0.05
Z=p
0.05 · 0.95/n
Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and
negative values - will count as evidence against H0 .
p
4. computation: z = (0.061 − 0.05)/ 0.05 · 0.95/1000 = 1.59
5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = 0.11 This is not a very small value, we
therefore have only very weak evidence against H0 .
Example 6.3.4 life time of disk drives
n1 = 30 and n2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress
and times to failure recorded:
Standard Design
n1 = 30
x̄1 = 1205 hr
s1 = 1000 hr
New Design
n2 = 40
x̄2 = 1400 hr
s2 = 900 hr
Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated” stress conditions?
1. state null hypothesis: H0 : µ1 = µ2 (µ1 − µ2 = 0)
2. alternative hypothesis: Ha : µ1 < µ2 (µ1 − µ2 < 0)
3. test statistic is:
x̄1 − x̄2 − 0
Z= q 2
s22
s1
n1 + n2
Z has under the null hypothesis a standard normal distribution, we will consider large negative values
of Z as evidence against H0 .
p
4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84
5. p-value: P (Z < −0.84) = 0.2005
This is not a very small value, we therefore have only very weak evidence against H0 .
Example 6.3.5 queueing systems
2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server.
We do simulations for each system, and look whether at time t = 2000 there is a server available:
System 1
System 2
n1 = 1000 runs n2 = 500 runs (each with different random seed)
server at time t = 2000 available?
551
p̂1 = 1000
p̂2 = 303
500
How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?
96
CHAPTER 6. STATISTICAL INFERENCE
3. test statistic is:
x̄1 − x̄2 − 0
Z= q 2
s1
s22
n1 + n2
Z has under the null hypothesis a standard normal distribution, we will consider large negative values
of Z as evidence against H0 .
p
4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84
5. p-value: P (Z < −0.84) = 0.2005
This is not a very small value, we therefore have only very weak evidence against H0 .
Example 6.3.5 queueing systems
2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server.
We do simulations for each system, and look whether at time t = 2000 there is a server available:
System 1
System 2
n1 = 1000 runs n2 = 500 runs (each with different random seed)
server at time t = 2000 available?
551
p̂1 = 1000
p̂2 = 303
500
How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?
1. state null hypothesis: H0 : p1 = p2 (p1 − p2 = 0)
2. alternative hypothesis: Ha : p1 6= p2 (p1 − p2 6= 0)
3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the
availability of a server would be
p̂ =
np̂1 + np̂2
551 + 303
= 0.569
=
n1 + n2
1000 + 500
a test statistic is:
Z=p
p̂1 − p̂2 − 0
q
p̂(1 − p̂) · n11 +
1
n2
Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as
evidence against H0 .
p
p
4. computation: z = (0.551 − 0.606)/( 0.569 · (1 − 0.569) 1/1000 + 1/500) = −2.03
5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities
of a server between the two systems.
6.4
Regression
A statistical investigation only rarely focusses on the distribution of a single variable. We are often interested
in comparisons among several variables, in changes in a variable over time, or in relationships among several
variables.
The idea of regression is that we have a vector X1 , . . . , Xk and try to approximate the behavior of Y by
finding a function g(X1 , . . . , Xk ) such that Y ≈ g(X1 , . . . , Xk ).
Simplest possible version is:
6.4. REGRESSION
6.4.1
97
Simple Linear Regression (SLR)
Situation: k = 1 and Y is approximately linearly related to X, i.e. g(x) = b0 + b1 x.
Notes:
• Scatterplot of Y vs X should show the linear relationship.
• linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the
“right” scale for the variables:
e.g. if y ≈ cxb , this is nonlinear in x, but it implies that
ln x + ln c,
ln y ≈ b |{z}
|{z}
x0
=:y 0
so on a log scale for both x and y-axis one gets a linear relationship.
Example 6.4.1 Mileage vs Weight
Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’
Union on a test track. Weight as reported by automobile manufacturer.
A scatterplot of mpg versus weight shows an indirect proportional relationship:
35
30
M 25
P
G
20
2.25
Transform weight by
1
x
3.00
Weight
3.75
to weight−1 . A scatterplot of mpg versus weight−1 reveals a linear relationship:
35
30
M 25
P
G
20
0.300
0.375
1/Wgt
0.450
Example 6.4.2 Olympics - long jump
Results for the long jump for all olympic games between 1900 and 1996 are:
98
CHAPTER 6. STATISTICAL INFERENCE
year
1960
1964
1968
1972
1976
1980
1984
1988
1992
1996
year long jump (in m)
1900
7.19
1904
7.34
1908
7.48
1912
7.60
1920
7.15
1924
7.45
1928
7.74
1932
7.64
1936
8.06
1948
7.82
1952
7.57
1956
7.83
A scatterplot of long jump versus year shows:
long jump (in m)
8.12
8.07
8.90
8.24
8.34
8.54
8.54
8.72
8.67
8.50
l
o 8.5
n
g
j 8.0
u
m 7.5
p
0
20
40
year
60
80
The plot shows that it is perhaps reasonable to say that
y ≈ β0 + β1 x
The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1 x, how do we derive empirical
values of β0 , β1 from n data points (x, y)? The standard answer is the “least squares” principle:
y
y=b0 + b1 x
0.75
0.50
0.25
-0.00
0.2
0.4
0.6
0.8
x
In comparing lines that might be drawn through the plot we look at:
Q(b0 , b1 ) =
n
X
(yi − (b0 + b1 xi ))
2
i=1
i.e. we look at the sum of squared vertical distances from points to the line and attempt to minimize this
6.4. REGRESSION
99
sum of squares:
d
Q(b0 , b1 )
db0
= −2
d
Q(b0 , b1 )
db1
= −2
n
X
i=1
n
X
(yi − (b0 + b1 xi ))
xi (yi − (b0 + b1 xi ))
i=1
Setting the derivatives to zero gives:
nb0 − b1
b0
n
X
xi − b1
i=1
n
X
i=1
n
X
xi
=
x2i
=
i=1
n
X
i=1
n
X
yi
xi yi
i=1
Least squares solutions for b0 and b1 are:
b1
=
Pn
Pn
Pn
Pn
1
(x − x̄)(yi − ȳ)
i=1 xi ·
i=1 xi yi − n
i=1 yi
i=1
Pn i
=
Pn
Pn
2
2
1
2
(x
−
x̄)
xi − (
xi )
i=1 i
i=1
n
b0
=
n
slope
i=1
n
1X
1X
yi − b1
xi
ȳ − x̄b1 =
n i=1
n i=1
y − intercept at x = 0
These solutions produce the “best fitting line”.
Example 6.4.3 Olympics - long jump, continued
X := year, Y := long jump
n
X
n
X
xi = 1100,
i=1
i=1
n
X
x2i = 74608
n
X
yi = 175.518,
i=1
yi2 = 1406.109,
n
X
xi yi = 9079.584
i=1
i=1
The parameters for the best fitting line are:
b1
=
b0
=
9079.584 −
1100·175.518
22
11002
22
74608 −
= 0.0155(in m)
175.518 1100
−
· 0.0155 = 7.2037
22
22
The regression equation is
high jump = 7.204 + 0.016year (in m).
It is useful for addition, to be able to judge how well the line describes the data - i.e. how “linear looking”
a plot really is.
There are a couple of means doing this:
100
CHAPTER 6. STATISTICAL INFERENCE
6.4.1.1
The sample correlation r
This is what we would get for a theoretical correlation % if we had random variables X and Y and their
distribution.
Pn
Pn
Pn
Pn
1
i=1 xi yi − n
i=1 xi ·
i=1 yi
i=1 (xi − x̄)(yi − ȳ)
= r
r := pPn
Pn
2
2
Pn
Pn
Pn
Pn
2
2
1
1
2
2
i=1 (xi − x̄) ·
i=1 (yi − ȳ)
i=1 xi − n (
i=1 xi )
i=1 yi − n (
i=1 yi )
The numerator is the numerator of b1 , one part under the root of the denominator is the denominator of b1 .
Because of its connection to %, the sample correlation r fulfills (it’s not obvious to see, and we want prove
it):
• −1 ≤ r ≤ 1
• r = ±1 exactly, when all (x, y) data pairs fall on a single straight line.
• r has the same sign as b1 .
Example 6.4.4 Olympics - long jump, continued
r= q
9079.584 −
(74608 −
1100·175.518
22
11002
n )(1406.109
−
= 0.8997
175.5182
)
22
Second measure for goodness of fit:
6.4.1.2
Coefficient of determination R2
This is based on a comparison of “variation accounted for” by the line versus “raw variation” of y.
The idea is that
!2
n
n
n
X
X
1 X
2
2
(yi − ȳ) =
yi −
yi
= SST T otal S um of S quares
n i=1
i=1
i=1
is a measure for the variability of y. (It’s (n − 1) · s2y )
y
0.75
0.50
y
0.25
-0.00
0.2
0.4
0.6
0.8
x
After fitting the line ŷ = b0 + b1 x, one doesn’t predict y as ȳ anymore and suffer the errors of prediction
above, but rather only the errors
ŷi − yi =: ei .
So, after fitting the line
n
X
i=1
e2i =
n
X
(yi − ŷ)2 = SSES um of S quares of E rrors
i=1
is a measure for the remaining/residual/ error variation.
6.4. REGRESSION
101
y
y=b0 + b1 x
0.75
0.50
0.25
-0.00
0.2
0.4
0.6
0.8
x
The fact is that SST ≥ SSE.
So: SSR := SST − SSE ≥ 0.
SSR is taken as a measure of “variation accounted for” in the fitting of the line.
The coefficient of determination R2 is defined as:
R2 =
SSR
SST
Obviously: 0 ≤ R2 ≤ 1, the closer R2 is to 1, the better is the linear fit.
Example 6.4.5 Olympics - long jump, continued
Pn
Pn
2
2
SST = i=1 yi2 − n1 ( i=1 yi ) = 1406.109 − 175.518
= 5.81.
22
SSE and SSR?
y
x
ŷ
y − ŷ (y − ŷ)2
7.185 0 7.204
-0.019
0.000
7.341 4 7.266
0.075
0.006
7.480 8 7.328
0.152
0.023
7.601 12 7.390
0.211
0.045
7.150 20 7.513
-0.363
0.132
7.445 24 7.575
-0.130
0.017
7.741 28 7.637
0.104
0.011
7.639 32 7.699
-0.060
0.004
8.060 36 7.761
0.299
0.089
7.823 48 7.947
-0.124
0.015
7.569 52 8.009
-0.440
0.194
7.830 56 8.071
-0.241
0.058
8.122 60 8.133
-0.011
0.000
8.071 64 8.195
-0.124
0.015
8.903 68 8.257
0.646
0.417
8.242 72 8.319
-0.077
0.006
8.344 76 8.381
-0.037
0.001
8.541 80 8.443
0.098
0.010
8.541 84 8.505
0.036
0.001
8.720 88 8.567
0.153
0.024
8.670 92 8.629
0.041
0.002
8.500 96 8.691
-0.191
0.036
SSE =
1.107
So SSR = SST − SSE = 5.810 − 1.107 = 4.703 and R2 = SSR
SST = 0.8095.
Connection between R2 and r
R2 is SSR/SST - that’s the squared sample correlation of y and ŷ.
If - and only if! - we use a linear function in x to predict y, i.e. ŷ = b0 + b1 x, the correlation between ŷ and
x is 1.
Then R2 (and only then!) is equal to the squared sample correlation between y and x = r2 :
R2 = r2 if and only if ŷ = b0 + b1 x
102
CHAPTER 6. STATISTICAL INFERENCE
Example 6.4.6 Olympics - long jump, continued
R2 = 0.8095 = (0.8997)2 = r2 .
It is possible to go beyond simply fitting a line and summarizing the goodness of fit in terms of r and R2 to
doing inference, i.e. making confidence intervals, predictions, . . . based on the line fitting. But for that, we
need a probability model.
6.4.2
Simple linear Regression Model
In words: for input x the output y is normally distributed with mean β0 + β1 x = µy|x and standard deviation
σ.
In symbols: yi = β0 + β1 xi + i with i i.i.d. normal N (0, σ 2 ).
β0 , β1 , and σ 2 are the parameters of the model and have to be estimated from the data (the data pairs
(xi , yi ).
Pictorially:
y
density of
y given x
x
How do we get estimates for β0 , β1 , and σ 2 ?
Point estimates:
β̂0 = b0 , βˆ1 = b1 from Least Squares fit (which gives β̂0 and βˆ1 the name Least Squares Estimates.
and σ 2 ?
σ 2 measures the variation around the “true” line β0 + β1 x - we don’t know that line, but only b0 + b1 x.
Should we base the estimation of σ 2 on this line?
The “right” estimator for σ 2 turns out to be:
n
σ̂ 2 =
1 X
SSE
(yi − ŷi )2 =
.
n − 2 i=1
n−2
Example 6.4.7 Olympics - long jump, continued
β̂0
= b0 = 7.2073 (in m)
β̂1
= b1 = 0.0155 (in m)
SSE
1.107
=
=
= 0.055.
n−2
20
σ̂ 2
Overall, we assume a linear regression model of the form:
y = 7.2037 + 0.0155x + e, with e ∼ N (0, 0.055).
Download