Chapter 6 Introduction to Statistical Inference

advertisement
Chapter 6
Introduction to Statistical Inference
Introduction
• Goal: Make statements regarding a population (or state of
nature) based on a sample of measurements
• Probability statements used to substantiate claims
• Example: Clinical Trial for Pravachol (5-year follow-up)
– Of 3302 subjects receiving Pravachol, 174 had heart incidences
– Of 3293 subjects receiving placebo, 248 had heart incidences
174
p Pravachol 
 .0527 (5.27%)
3302
^
248
p placebo 
 .0753 (7.53%)
3293
Probabilit y that Pravachol would do this much better if not effective : .000088
Approximat ely one chance in 11363
^
Estimating with Confidence
• Goal: Estimate a population mean (proportion) based on
sample mean (proportion)
• Unknown: Parameter (m, p)
• Known: Approximate Sampling Distribution of Statistic
 

X ~ N  m,

n


p ~ N  p,

^
p (1  p ) 


n

• Recall: For a random variable that is normally distributed, the
probability that it will fall within 2 standard deviations of mean is
approximately 0.95

 

P m  2
 X  m2
  0.95
n
n


p(1  p) ^
p(1  p) 

  0.95
P p  2
 p p2

n
n


Estimating with Confidence
• Although the parameter is unknown, it’s highly likely
that our sample mean or proportion (estimate) will lie
within 2 standard deviations (aka standard errors) of
the population mean or proportion (parameter)
• Margin of Error: Measure of the upper bound in
sampling error with a fixed level (we will use 95%) of
confidence. That will correspond to 2 standard errors:
Mean : Margin of Error (95% Confidence ) : 2

n
Proportion : Margin of Error (95% Confidence ) : 2
Confidence Interval : estimate  margin of error
p(1  p)
n
Confidence Interval for a Mean m
• Confidence Coefficient (C): Probability (based on
repeated samples and construction of intervals) that a
confidence interval will contain the true mean m
• Common choices of C and resulting intervals:
90% Confidence : x  1.645
95% Confidence : x  1.960
99% Confidence : x  2.576
C % Confidence : x  z *

n

n

n

n
C
90%
95%
99%
z*
1.645
1.960
2.576
Normal Distribution
1 C
2
1 C
2
C
mz
*

n
m
m  z*

n
Standard Normal Distribution
1 C
2
1 C
2
C
z
*
0
z*
Philadelphia Monthly Rainfall (1825-1869)
1
Histogram
2
3
140
4
5
10
0
11
e
20
12
M
or
9
15
40
13
8
11
60
9
7
7
80
5
6
3
100
1
Frequency
120
13
14
15
m  3.68   1.92 Margin of error (n  20, C  95%) : 1.96
1.92
 0.84
20
4 Random Samples of Size n=20, 95% CI’s
Sample 1
Month
156
51
176
364
271
7
312
219
16
484
316
318
517
249
445
13
479
370
348
89
Mean
Mean-me
Mean+me
Rain
2.56
2.87
4.64
2.05
2.76
2.06
4.51
4.41
3.87
2.83
4.56
3.44
3.62
2.16
4.79
1.11
3.93
4.11
2.17
5.40
3.39
2.55
4.23
Ran#
0.0028
0.0050
0.0052
0.0082
0.0142
0.0145
0.0153
0.0160
0.0171
0.0190
0.0202
0.0257
0.0272
0.0301
0.0320
0.0324
0.0325
0.0345
0.0374
0.0380
Sample 2
Month
349
149
227
336
124
330
468
293
511
235
314
372
164
48
236
50
39
417
503
151
Rain
2.33
4.86
4.15
5.17
4.33
4.03
4.63
3.99
2.39
5.28
3.11
5.42
2.78
0.26
2.40
3.75
3.35
7.68
1.76
5.89
3.88
3.04
4.72
Ran#
0.0007
0.0013
0.0054
0.0073
0.0081
0.0101
0.0132
0.0145
0.0149
0.0172
0.0190
0.0260
0.0272
0.0281
0.0284
0.0319
0.0325
0.0333
0.0359
0.0361
Sample 3
Month
185
527
114
312
49
398
396
99
181
364
392
477
434
229
223
279
520
245
183
41
Rain
2.69
5.28
3.99
4.51
5.37
2.29
5.55
2.22
1.84
2.05
7.59
7.16
2.07
4.05
4.54
2.76
5.44
1.60
2.63
3.49
3.86
3.02
4.70
Ran#
0.0005
0.0029
0.0048
0.0084
0.0085
0.0166
0.0187
0.0233
0.0235
0.0244
0.0253
0.0283
0.0290
0.0318
0.0320
0.0364
0.0374
0.0374
0.0391
0.0395
Sample 4
Month
171
175
130
167
101
33
299
337
447
78
117
399
52
162
95
479
51
380
61
302
m  3.68   1.92 Margin of error (n  20, C  95%) : 1.96
Rain
1.50
2.52
1.22
3.35
5.88
0.79
2.60
1.85
3.55
3.53
3.57
1.09
4.99
6.60
2.59
3.93
2.87
6.00
1.63
2.87
3.15
2.31
3.99
Ran#
0.0011
0.0048
0.0085
0.0094
0.0133
0.0148
0.0164
0.0191
0.0193
0.0213
0.0224
0.0227
0.0240
0.0261
0.0296
0.0296
0.0303
0.0311
0.0324
0.0339
1.92
 0.84
20
Factors Effecting Confidence Interval Width
• Goal: Have precise (narrow) confidence intervals
– Confidence Level (C): Increasing C implies increasing
probability an interval contains parameter implies a wider
confidence interval. Reducing C will shorten the interval (at a
cost in confidence)
– Sample size (n): Increasing n decreases standard error of
estimate, margin of error, and width of interval (Quadrupling
n cuts width in half)
– Standard Deviation (): More variable the individual
measurements, the wider the interval. Potential ways to
reduce  are to focus on more precise target population or
use more precise measuring instrument. Often nothing can be
done as nature determines 
Selecting the Sample Size
• Before collecting sample data, usually have a goal for
how large the margin of error should be to have useful
estimate of unknown parameter (particularly when
comparing two populations)
• Let m be the desired level of the margin of error and 
be the standard deviation of the population of
measurements (typically will be unknown and must be
estimated based on previous research or pilot study
• The sample size giving this margin of error is:
z 
  

mz 
  n  
 n
 m 
*
*
2
Precautions
• Data should be simple random sample from population
(or at least can be treated as independent observations)
• More complex sampling designs have adjustments
made to formulas (see Texts such as Elementary Survey
Sampling by Scheaffer, Mendenhall, Ott)
• Biased sampling designs give meaningless results
• Small sample sizes from nonnormal distributions will
have coverage probabilities (C) typically below the
nominal level
• Typically  is unknown. Replacing it with sample
standard deviation s works as a good approximation in
large samples
Significance Tests
• Method of using sample (observed) data to challenge a
hypothesis regarding a state of nature (represented as
particular parameter value(s))
• Begin by stating a research hypothesis that challenges a
statement of “status quo” (or equality of 2 populations)
• State the current state or “status quo” as a statement
regarding population parameter(s)
• Obtain sample data and see to what extent it
agrees/disagrees with the “status quo”
• Conclude that the “status quo” is not true if observed
data are highly unlikely (low probability) if it were true
Pravachol and Olestra
• Pravachol vs Placebo wrt heart disease/death
– Pravachol: 5.27% of 3302 patients suffer MI or death to CHD
– Placebo: 7.53% of 3293 patients suffer MI or death to CHD
– Probability of difference this large for Pravachol if no more
effective than placebo is .000088 (will learn formula later)
• Olestra vs Triglyceride Chips wrt GI Symptoms
– Olestra: 15.81% of 563 subjects report GI symptoms
– Triglyceride: 17.58% of 529 subjects report GI symptoms
– Probability of difference this large in either direction (olestra
better or worse) is .4354
• Strong evidence of Pravachol effect vs placebo
• Weak to no evidence of Olestra effect vs Triglyceride
Elements of a Significance Test
• Null hypothesis (H0): Statement or theory being tested. Will be
stated in terms of parameters and contain an equality. Test is set
up under the assumption of its truth.
• Alternative Hypothesis (Ha): Statement contradicting H0. Will be
stated in terms of parameters and contain an inequality. Will only
be accepted if strong evidence refutes H0 based on sample data.
May be 1-sided or 2-sided, depending on theory being tested.
• Test Statistic (TS): Quantity measuring discrepancy between
sample statistic (estimate) and parameter value under H0
• P-value: Probability (assuming H0 true) that we would observe
sample data (test statistic) this extreme or more extreme in favor
of the alternative hypothesis (Ha)
Example: Interference Effect
• Does the way items are presented effect task time?
–
–
–
–
–
–
Subjects shown list of color names in 2 colors: different/black
Xi is the difference in times to read lists for subject i: diff-blk
H0: No interference effect: mean difference is 0 (m = 0)
Ha: Interference effect exists: mean difference > 0 (m > 0)
Assume standard deviation in differences is  = 8 (unrealistic*)
Experiment to be based on n=70 subjects
Parameter value under H 0 : m  0
Approximat e Distributi on of sample mean under H 0 : X ~ N (0,

n

8
 0.96)
70
Observed sample mean : x  2.39
How likely to observe sample mean difference  2.39 if m = 0?
Sampling Distribution of X-bar
P-value
0
2.39
Computing the P-Value
• 2-sided Tests: How likely is it to observe a sample mean
as far of farther from the value of the parameter under
the null hypothesis? (H0: m  m0 Ha: m  m0)
X  m0
 

Under H 0 : X ~ N  m 0 ,
~ N (0,1)
  Z

n

n
After obtaining the sample data, compute the mean and convert
it to a z-score (zobs) and find the area above |zobs| and below -|zobs|
from the standard normal (z) table
• 1-sided Tests: Obtain the area above zobs for upper tail tests
(Ha:m > m0) or below zobs for lower tail tests (Ha:m < m0)
Interference Effect (1-sided Test)
• Testing whether population mean time to read list of colors is
higher when color is written in different color
• Data: Xi: difference score for subject i (Different-Black)
• Null hypothesis (H0): No interference effect (m = 0)
• Alternative hypothesis (Ha): Interference effect (m > 0)
• “Known”: n=70,  = 8 (This won’t be known in practice but can be
replaced by sample s.d. for large samples)
Sample Data : x  2.39 s  7.81 n  70
2.39  0 2.39
Test Statistic (Based on   8) : zobs 

 2.49
 8  0.96


 70 
2.39  0 2.39
Test Statistic (Based on s  7.81) : zobs 

 2.57
7
.
81
0
.
93




 70 
P - value (Based on   8) : P( Z  2.49)  1  .9936  .0064
P - value (Based on s  7.81) : P( Z  2.57)  1  .9949  .0051
Interference Effect (2-sided Test)
• Testing whether population mean time to read list of colors is
effected (higher or lower) when color is written in different color
• Data: Xi: difference score for subject i (Different-Black)
• Null hypothesis (H0): No interference effect (m = 0)
• Alternative hypothesis (Ha): Interference effect (+ or -) (m  0)
• “Known”: n=70,  = 8 (This won’t be known in practice but can be
replaced by sample s.d. for large samples)
Sample Data : x  2.39 s  7.81 n  70
2.39  0 2.39
Test Statistic (Based on   8) : zobs 

 2.49
 8  0.96


 70 
2.39  0 2.39
Test Statistic (Based on s  7.81) : zobs 

 2.57
7
.
81
0
.
93




 70 
P - value (Based on   8) : 2 P( Z | 2.49 |)  2(1  .9936)  .0128
P - value (Based on s  7.81) : 2 P( Z | 2.57 |)  2(1  .9949)  .0102
Equivalence of 2-sided Tests and CI’s
• For a = 1-C, a 2-sided test conducted at a significance
level will give equivalent results to a C-level confidence
interval:
– If entire interval > m0, P-value < a , zobs > 0 (conclude m > m0)
– If entire interval < m0, P-value < a , zobs < 0 (conclude m < m0)
– If interval contains m0, P-value > a (don’t conclude m m0)
• Confidence interval is the set of parameter values that we
would fail to reject the null hypothesis for (based on a 2sided test)
Decision Rules and Critical Values
• Once a significance (a) level has been chosen a
decision rule can be stated, based on a critical value:
• 2-sided tests: H0: m = m0 Ha: m  m0
– If test statistic (zobs) > za/2 Reject Ho and conclude m > m0
– If test statistic (zobs) < -za/2 Reject Ho and conclude m < m0
– If -za/2 < zobs < za/2 Do not reject H0: m = m0
• 1-sided tests (Upper Tail): H0: m = m0 Ha: m > m0
– If test statistic (zobs) > za Reject Ho and conclude m > m0
– If zobs < za Do not reject H0: m = m0
• 1-sided tests (Lower Tail): H0: m = m0
Ha: m < m0
– If test statistic (zobs) < -za Reject Ho and conclude m < m0
– If zobs > -za Do not reject H0: m = m0
Potential for Abuse of Tests
• Should choose a significance (a) level in advance
and report test conclusion (significant/nonsignificant)
as well as the P-value. Significance level of 0.05 is
widely used in the academic literature
• Very large sample sizes can detect very small
differences for a parameter value. A clinically
meaningful effect should be determined, and
confidence interval reported when possible
• A nonsignificant test result does not imply no effect
(that H0 is true).
• Many studies test many variables simultaneously.
This can increase overall type I error rates
Large-Sample Test H0:m1-m2=0 vs H0:m1-m2>0
• H0: m1-m2 = 0 (No difference in population means
• HA: m1-m2 > 0 (Population Mean 1 > Pop Mean 2)
 T .S . : zobs 
x1  x 2
s12
s22

n1
n2
 R.R. : zobs  za
 P  value : P ( Z  z obs )
• Conclusion - Reject H0 if test statistic falls in rejection region,
or equivalently the P-value is  a
Example - Botox for Cervical Dystonia
• Patients - Individuals suffering from cervical dystonia
• Response - Tsui score of severity of cervical dystonia
(higher scores are more severe) at week 8 of Tx
• Research (alternative) hypothesis - Botox A decreases
mean Tsui score more than placebo
• Groups - Placebo (Group 1) and Botox A (Group 2)
• Experimental (Sample) Results:
x1  10.1 s1  3.6 n1  33
x 2  7.7 s2  3.4 n2  35
Source: Wissel, et al (2001)
Example - Botox for Cervical Dystonia
Test whether Botox A produces lower mean Tsui
scores than placebo (a = 0.05)
 H 0 : m1  m 2  0
 H A : m1  m 2 > 0
10.1  7.7
2. 4
 T .S . : zobs 

 2.82
2
2
0.85
(3.6) (3.4)

33
35
 R.R. : zobs  za  z.05  1.645
 P  val : P ( Z  2.82)  .0024
Conclusion: Botox A produces lower mean Tsui scores than
placebo (since 2.82 > 1.645 and P-value < 0.05)
2-Sided Tests
• Many studies don’t assume a direction wrt the
difference m1-m2
• H0: m1-m2 = 0
HA: m1-m2  0
• Test statistic is the same as before
• Decision Rule:
– Conclude m1-m2 > 0 if zobs  za/2 (a=0.05  za/2=1.96)
– Conclude m1-m2 < 0 if zobs  -za/2 (a=0.05  -za/2= -1.96)
– Do not reject m1-m2 = 0 if -za/2  zobs  za/2
• P-value: 2P(Z |zobs|)
Power of a Test
• Power - Probability a test rejects H0 (depends on m1- m2)
– H0 True: Power = P(Type I error) = a
– H0 False: Power = 1-P(Type II error) = 1-b
· Example:
· H0: m1- m2 = 0 HA: m1- m2 > 0
 12 = 22  25 n1 = n2 = 25
· Decision Rule: Reject H0 (at a=0.05 significance level) if:
zobs 
x1  x 2

2
1
n1


2
2
n2
x1  x 2

 1.645 
2
x1  x 2  2.326
Power of a Test
• Now suppose in reality that m1-m2 = 3.0 (HA is true)
• Power now refers to the probability we (correctly)
reject the null hypothesis. Note that the sampling
distribution of the difference in sample means is
approximately normal, with mean 3.0 and standard
deviation (standard error) 1.414.
• Decision Rule (from last slide): Conclude population
means differ if the sample mean for group 1 is at least
2.326 higher than the sample mean for group 2
• Power for this case can be computed as:
P( X 1  X 2  2.326)
X 1  X 2 ~ N (3, 2.0  1.414)
Power of a Test
2.326  3
Power  P( X 1  X 2  2.326)  P( Z 
 0.48)  .6844
1.41
• All else being equal:
• As sample sizes increase, power increases
• As population variances decrease, power increases
• As the true mean difference increases, power increases
Power of a Test
Distribution (H0)
Distribution (HA)
Power of a Test
Power Curves for group sample sizes of 25,50,75,100 and
varying true values m1-m2 with 1=2=5.
• For given m1-m2 , power increases with sample size
• For given sample size, power increases with m1-m2
Sample Size Calculations for Fixed Power
• Goal - Choose sample sizes to have a favorable chance of
detecting a clinically meaning difference
• Step 1 - Define an important difference in means:
– Case 1:  approximated from prior experience or pilot study - dfference
can be stated in units of the data
– Case 2:  unknown - difference must be stated in units of standard
deviations of the data
m1  m 2


• Step 2 - Choose the desired power to detect the the clinically
meaningful difference (1-b, typically at least .80). For 2-sided test:
2(za / 2  z b 
2
n1  n2 
2
Download