４
Data :
X
1
,
X
2
,
,
X n i
.
i
.
d
.
~
N
(
,
2
)
(
n
30 )
100 years ago – people used Ztest
This is OK for n large
(
n
30 )
⑤
CLT (** However the exact ttest is better because it is exact **)
This is NOT OK if the sample size is samll.
1
“A Student of Statistics”
– pen name of
William Sealy Gosset
(June 13, 1876–October 16, 1937) http://en.wikipedia.org/wiki/William_Sealy_Gosset
“The Student’s ttest”
P.Q.
T
X
S
/
n
~
t n
1
(Exact tdistribution with n1 degrees of freedom )
In this scenario (normal population, variance unknown, sample size is small), the Ztest is the wrong test. For example, for a 2sided alternative hypothesis, the wrong decision rule at the significance level of
is:
Reject
H
0 if

Z
0

Z
2
Right Test for a 2sided alternative hypothesis
Reject
H
0 if

T
0

t n
1 ,
2
(Note: since
t n
1 ,
2
>
Z
2
, we will be wrongly rejecting
H
0 when
2
Z
2
≤

T
0
 <
t n
1 ,
2
)
*Even if the sample size is large, if the population is normal, the ttest is the exact test and thus is better than the large sample approximate Ztest based on the CLT.
(Because the t distribution has heavier tails than the standard normal distribution.)
H
0
H a
:
:
0
0
* Test Statistic
T
0
X
S
0
H
0
~
t n
1
n
* Reject region : Reject
H
0
at
if

T
0

t n
1 ,
2
* pvalue = shaded area * 2
3
1.
Definition : tdistribution
T
Z
W k
~
t k
Z
~
N
( 0 , 1 )
W
~
k
2
(chisquare distribution with k degrees of freedom)
Z
&
W
are independent.
2.
Def 1 : chisquare distribution : from the definition of the gamma distribution
(chisquare distributio is a special gamma distribution
Which one? Check it out please, class.)
3.
Def 2 : chisquare distribution : Let
Z
1
,
Z
2
,
,
Z k i
.
i
.
d
.
~
N
( 0 , 1 ) ,
then
W
i k
1
Z i
2
~
k
2
Theorem
Sampling from the normal population
Let
X
1
,
X
2
,
,
X n i
.
i
.
d
.
~
N
(
,
2
) , then
1)
X
~
N
(
,
n
2
)
2)
W
(
n
1 )
S
2
2
~
2
n
1
3)
X
and
S
2
(and thus
W
) are independent. Thus we have:
T
X
S
n
~
t n
1
Proof)
Z
X
n
~
N
( 0 , 1 )
Let
W
(
n
1 )
S
2
2
~
n
2
1
T
X
(
n
1 )
S
2
2
n
(
n
1 )
X
S
~
t n
1
n
4
4.
The following is a summary of the decision rules for the Ttest, using either the rejection region approach or the pvalue approach:.
H
0
H a
:
:
0
0
H
0
H a
:
0
:
0
H
0
H a
:
:
0
0
Observed value of test statistic
T
0
X
0
H
0
~
t n
1
S n
Rejection region : we reject
H
0
in favor of
H a
at the significance level
if
T
0
t n
1 ,
T
0
t n
1 ,

T
0

t n
1 ,
2 pvalue
P
(
T
0
t
0

H
0
) pvalue
P
(
T
0
t
0

H
0
) pvalue
P
(
T
0


t
0

H
0
)
2
P
(
T
0


t
0

H
0
)
(1) the area under
t n
1
pdf to the right of
t
0
(2) the area under
t n
1
pdf to the left of
t
0
(3) twice the area under
t n
1
to the right of 
t
0

Jerry is planning to purchase a sports good store. He calculated that in order to cover basic expenses, the average daily sales must be at least $525.
Scenario A
. He checked the daily sales of 36 randomly selected business days, and found the average daily sales to be $565 with a standard deviation of $150.
Scenario B
. Now suppose he is only allowed to sample 9 days. And the 9 days sales are
$510, 537, 548, 592, 503, 490, 601, 499, 640.
For A and B, please determine whether Jerry can conclude the daily sales to be at least
$525 at the significance level of
0 .
05 . What is the pvalue for each scenario?
Solution A
large sample (
⑤
) n=36,
x
565 ,
s
150
H
0
:
525 versus
H a
:
525
*** First perform the ShapiroWilk test to check for normality. If normal, use the exact
Ttest. If not normal, use the large sample Ztest. In the following, we assume the
5
population is found not normal – and we perform the large sample approximate Ztest based on the CLT.
Test statistic
z
0
x
0
s n
565
525
150 36
1 .
6
At the significance level
0 .
05 , we will reject
H
0
if
z
0
Z
0 .
05
1 .
645
We can not reject
H
0 pvalue pvalue = 0.0548
Alternatively, if you can show the population is normal using the ShapiroWilk test, it is better that you perform the exact ttest.
Solution B
small sample
ShapiroWilk test
If the population is normal, ttest is suitable.
(*If the population is not normal, and the sample size is small, we shall use the nonparametric test such as Wilcoxon Signed Rank test.)
In the following, we assume the population is found normal.
x
546 .
67 ,
s
53 .
09 ,
n
9
H
0
:
525 versus
H a
:
525
Test statistic
t
0
X
0
S n
546 .
67
525
53 .
09 9
1 .
22
6
At the significance level
0 .
05 , we will reject
H
0
if
t
0
T
8 , 0 .
05
1 .
86
We can not reject
H
0 pvalue
What’s the pvalue when
t
0
1 .
22 ?
Use R: 1pt(1.22,8) →the answer is: p = 0.1286031
1.
5.
*On the construction of confidence intervals:
𝑋
1
, 𝑋
2
, … , 𝑋 𝑛
are i.i.d random sample of size n (n>30) from a population with unknown and nonnormal distribution, please derive the
100(1 − 𝛼)%
CI for 𝜇
.
Ans: According to the CLT,
𝑍 =
𝑋̅−𝜇
𝑆
~̇𝑁(0,1)
,
√𝑛 where
𝑆 = √
∑ 𝑛 𝑖=1
(𝑋 𝑖 𝑛−1
−𝑋̅)
2
.
Since
𝑃 (−𝑧 𝛼
2
≤
𝑋̅−𝜇
𝑆
≤ 𝑧 𝛼
2
) = 1 − 𝛼
,
√𝑛
then the
100(1 − 𝛼)%
CI for 𝜇
is
(𝑋̅ − 𝑧 𝛼
2
*If the population variance
(𝑋̅ − 𝑧 𝛼
2 𝜎
√𝑛
, 𝑋̅ + 𝑧 𝛼
2 𝜎
√𝑛
)
. 𝜎
2
𝑆
√𝑛
, 𝑋̅ + 𝑧 𝛼
2
𝑆
√𝑛
)
.
is known, then the CI will be
2.
𝑋
1
, 𝑋
2
, … , 𝑋 𝑛
are i.i.d random sample of size n from
𝑁(𝜇, 𝜎
2
)
, where 𝜎
2
is unknown, derive the
100(1 − 𝛼)%
CI for 𝜇
.
7
Ans: Since
𝑋̅−𝜇
𝑆
√𝑛
~𝑡 𝑛−1
,
then
𝑃 (−𝑡 𝑛−1, 𝛼
2
≤
𝑋̅−𝜇
𝑆
≤ 𝑡 𝑛−1, 𝛼
2
) = 1 − 𝛼
,
√𝑛
so the
100(1 − 𝛼)%
CI for 𝜇
is
𝑆
(𝑋̅ − 𝑡 𝑛−1, 𝛼
2
√𝑛
, 𝑋̅ + 𝑡 𝑛−1, 𝛼
2
𝑆
√𝑛
)
.
3.
When population is normal and 𝜎
2
is known, which one of the two CIs above is better?
Ans: Because t distribution has longer tail then normal distribution
then when
S
is very close to 𝜎
,
(𝑋̅ − 𝑧 𝛼
2 𝜎
√𝑛
, 𝑋̅ + 𝑧 𝛼
2
(𝑋̅ − 𝑡 𝑛−1, 𝛼
2
𝑆
√𝑛 𝜎
√𝑛
)
will be shorter than
, 𝑋̅ + 𝑡 𝑛−1, 𝛼
2
𝑆
√𝑛
)
,
so the CI based on the Z statistic is better.
*** A Quick Review of
Random sample
:
X
1
,
X
2
,
,
X n i
.
i
.
d
.
~ population with pdf f(x)
X
( 1 )
X
( 2 )
X
(
n
)
*** In the following, we shall derive the pdf of the first order statistic using the cummulative distribution function (c.d.f.) method
1
F
X
( 1 )
(
x
)
P
(
X
( 1 )
x
)
P
(min(
X
1
,
X
2
,
,
X n
)
x
)
P
(
X
1
x
,
X
2
x
,
,
X n
x
)
P
(
X
1
x
)
P
(
X
2
x
)
P
(
X n
x
)
[ 1
F
(
x
)]
n
f
X
( 1 )
(
x
)
n
[ 1
F
(
x
)]
n
1
f
(
x
)
f
X
( 1 )
(
x
)
n
[ 1
F
(
x
)]
n
1
f
(
x
)
*** You can derive the pdf of the last (largest) order statistic similarly.
①
Power of the test
8
②
Sample size calculation
③
Do it in SAS & R
④
Inference on
2
Quiz 3 will be given on Thursday, 09/19/2013
9