Stat 430
Fall 2011
•
Hypothesis Testing:
• large sample vs small sample
• two sample comparisons
• multiple testing
•
Intro to Linear Models
•
State null hypothesis
•
State alternative hypothesis
•
State test statistic
•
Find distribution of test statistic under null hypothesis
•
Compute p-value
•
Draw conclusion: reject null hypothesis (or fail to reject)
•
Type I error
P( reject H
0
| H
0
is true) = alpha
•
Type II error
P( accept H
0
| H
0
is false) = beta
accept H
0 reject H
0
H
0
true
Type I
H
0
false
Type II
• trying to decrease type I error will usually increase type II and vice versa
•
Type I is called the significance level
•
Power of a test: P ( reject H
0
| H
0
false )
• acceptance/rejection region: all observed values that will lead us to not reject/reject the null hypothesis
•
Test statistic: quantity that reflects how close we are to the null hypothesis this quantity needs to have a known distribution
•
P-value: probability to observe value of the test statistic (or more extreme value) assuming that the null hypothesis is true .
•
Reject H
0
, if P-value is “too small” (i.e. usually below 5%)
Example: Gene Expression
•
Expression values for gene 244912_at experimental condition:
8.18 8.16 7.95
controls:
8.64 8.28 8.7 8.34 8.41 8.35 8.42 8.11 8.55
Example: Gene Expression
•
Our assumption: gene 244912_at is underexpressed due to the treatment in the experiment (we would like to show that)
•
H
0
:
H a
:
Example: Gene Expression
•
Test statistic: compare mean expression values
• mean
S
= 8.42 sd
S
= 0.18
mean
C
= 8.10 sd
C
= 0.13
•
T = (8.42-8.10)/ √ (0.13
2 /3+0.18
2 /9) = 3.3
it’ll turn out, that we reject the null hypothesis and accept the alternative hypothesis
H
0 p=p
0
µ=µ
0
µ
1
-µ
2
= d p
1
-p
2
= d
test statistic
�
ˆ
ˆ
(1
−
− p
0 p ) /n
¯ s/
− µ n
0
� s
¯
1
−
�
1
/n
1
¯
+
2 s
−
2
2 d
/n
2
�
ˆ
1
(1 − p
1 p ˆ
1
− p ˆ
2
− d
) /n
1
+ ˆ
2
(1 − p
2
) /n
2
All these test statistics have a large sample standard normal distribution
distr.
H
0 p=p
0
µ=µ
0
µ
1
-µ
2
= d p
1
-p
2
= d test statistic
�
ˆ
ˆ
(1
−
− p
0 p ) /n
¯ s/
− µ n
0
�
¯
1
− s
�
1
/n
1
¯
+
2
− s 2
2 d
/n
2
�
ˆ
1
(1 − p
1 p ˆ
1
− p ˆ
2
− d
) /n
1
+ ˆ
2
(1 − p
2
) /n
2
*** needs some more work t n-1 t n-1
***
***
•
Situation I samples 1 and 2 are paired, i.e. values in sample 1 correspond to a ‘before’ measurement, values in sample 2 correspond to an ‘after’ measurement on the same individual/experimental unit
•
Situation I: paired samples obtain new data as differences from sample, i.e. that way we are in a single sample testing situation
•
Situation II: independent samples, assume same variance
• pooled variance: s 2 = [(n
1
-1) s
1
2 + (n
2
-1) s
2
2 ]/(n
1
+n
2
-2)
• test statistic has t distribution with
(n
1
+n
2
-2) degrees of freedom
•
Situation III:
1 n
�
( y i
− y ˆ i
)
2
• variance of difference: s 2
θ ij
= [s
1
2
= log
/n
1
+ s
2
2 m m i
2 i
]
+1
+1
,j
,j
+1 m m i,j ij
+1
= ...
= β i +1
− β i
• test statistic has t distribution with k df
...
= β
•
Welch-Satterthwaite: m i +1 ,j m i,j +1
( s 2
1
+ s 2
2
) 2 k = s 4
1
/ ( n
1
− 1) + s 4
2
/ ( n
2
− 1)
� s
¯
1
−
2
1
/n
1
¯
+
2 s
−
2
2 d
/n
2
¯ s/
− µ n
0
�
ˆ
ˆ
(1
−
− p
0 p ) /n
�
ˆ
1
(1 − p
1 p ˆ
1
− p ˆ
2
− d
) /n
1
+ ˆ
2
(1 − p
2
) /n
2
λ
XY ij
= β u i v j
λ
XY ij
= β i v j
λ
XY ij
= β j u i
λ
XY ij
= β i
β j
�
¯ − t · √ n
,
¯
+ t · √ n
�
H o
: π ijk
= π i ++
π
+ j +
π
++ k
H o
: π ijk
= π i ++
π
+ jk
H o
: π ijk
= π ij +
π
+ jk
/ π
++ k
ˆ
1
− p
2
± z ·
�
ˆ
1
(1 − ˆ
1
) /n
1
+ ˆ
2
(1 − p
2
) /n
2
1
•
In the gene expression example there are
22,810 genes
•
5,628 show a significant difference in expression between study and control (at a .05 level)
• not practicable!
•
Bonferroni Adjustment
•
False Discovery Rate
•
Graphics
•
Lower significance level according to the number of tests:
•
P-values have to be less than alpha/n instead of less than alpha
•
Gene expression: 94 genes significant on
0.05/22,810 level -- very conservative!
12
11
10
9
8
7
6
S11 S12 S13 S21 very conservative!
S22 S23 variable
S31 S32 S33 C1 C2 C3
•
Type I error: % of false positive results
• for n=22,810 we’d expect 1140.5 false positives at 5% significance level
• we got 5,628, therefore 1140.5/5628 =
20.2
% are false positives
•
Idea: control the false discovery rate
•
Pick cutoff of significance level such that the false discovery rate is under a threshold
• gene expression:
2,279 results alpha
0.05
0.025
0.01
FDR
20.3%
13.4%
7.7%
0.009
5.1%
0.0049956
5.0%
13
12
11
10
9
8
7
6
12
10
8
6
4
S1 S2 C1 C2
-1
C3 C4 C5 C6 variable
S1 S2 C1 C2
1
C3 C4 C5 C6
S11 S12 S13 S21 S22 S23 S31 S32 S33 C1 C2 C3 variable
S11 S12 S13 S21 S22 S23 S31 S32 S33 C1 C2 C3
On average, 5% of the results are false positives
• might help to distinguish between
“interesting” and not so interesting results from the other two methods
• problem: we need to be creative - graphics are not an out of the box solution
•
Here: separate out those genes that have a very low overall variance, i.e. “flat-liners”
12
11
10
9
8
7
6
S11 S12 S13 S21 S22 S23 S31 S32 S33 C1 C2 C3 variable
S11 S12 S13 S21 S22 S23 S31 S32 S33 C1 C2 C3
Linear Models
•
Response (or dependent) variable
Y
•
Explanatory (or co-variate) variables
X
1
, X
2
, ...., X p
•
Try to find a relationship f that helps to determine outcome Y based on values of
X
1
, X
2
^
, ...., X
Y = f(X p
1
, X
2
, ...., X p
)
Simple Linear Regression
•
Situation: we assume that f is a linear function, and p=1
• i.e. we want to find f() with
f(x) = a + b*x that is “closest” to values of Y i.e. we want to find values for a and b
Simple Linear Regression
•
Model
Y = a + bX + error
!"#"$%$&'(")"&*%*+"
•
We get values for parameters a and b as a = -18.85
b = 0.013
• a is the intercept - i.e. the value for Y if X=0 in this data the interpretation is a bit obscure: for the year
0 we would expect the winner to jump 18.85m backwards
(quite a feat!)
• b is the average increase that we expect for Y when we increase X by 1 unit: for each year we expect the winner to jump 1.3 cm further, from one Olympics to the next we’d expect an increase of 5.2 cm
Simple Linear Regression
•
How do we get a and b?
•
How good is the model?
•
What are confidence intervals for the parameters a and b, for predicted values?