slides - Wellcome Trust Centre for Human Genetics

advertisement
Linear Modelling I
Richard Mott
Wellcome Trust Centre for Human
Genetics
Synopsis
•
•
•
•
Linear Regression
Correlation
Analysis of Variance
Principle of Least Squares
Correlation
Correlation and linear regression
•
•
•
•
Is there a relationship?
How do we summarise it?
Can we predict new obs?
What about outliers?
Correlation Coefficient r
• -1 < r < 1
• r=1 perfect positive linear
• r=0 no relationship
• r=-1 perfect negative linear
• r=0.6
Examples of Correlation
(taken from Wikipedia)
Calculation of r
• Data
Correlation in R
> cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")
[1] 0.2577617
> cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")
Pearson's product-moment correlation
data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL
t = 11.1473, df = 1746, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2134566 0.3010088
sample estimates:
cor
0.2577617
> pt(11.1473,df=1746,lower.tail=FALSE)
[1] 3.154319e-28
# T distribution on 1746 degrees of freedom
Linear Regression
Fit a straight line to data
• a
• b
• ei
intercept
slope
error
– Normally distributed
– E(ei) = 0
– Var(ei) = s2
Example: simulated data
R code
>
>
>
>
>
>
#
x
e
x
e
y
simulate 30 data points
<- rnorm(30)
<- rnorm(30)
<- 1:30
<- rnorm(30,0,5)
<- 1 + 3*x + e
> # fit the linear model
> f <- lm(y ~ x)
> # plot the data and the predicted line
> plot(x,y)
> abline(reg=f)
> print(f)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
-0.08634
x
3.04747
Least Squares
• Estimate a, b by least
squares
• Minimise sum of squared
residuals between y and the
prediction a+bx
• Minimise
Why least squares?
•
LS gives simple formulae for the estimates for a, b
•
If the errors are Normally distributed then the LS estimates are
“optimal”
In large samples the estimates converge to the true values
No other estimates have smaller expected errors
LS = maximum likelihood
•
Even if errors are not Normal, LS estimates are often useful
Analysis of Variance (ANOVA)
LS estimates have an important property:
they partition the sum of squares (SS) into fitted and error components
2
2
2
ˆ
ˆ
ˆ
(y
y
)
=
(
b
(x
x
))
+
(y
b
x
a
)
å i
å i
å i i
i
i
i
• total SS
= fitting SS + residual SS
• only the LS estimates do this
Component
Fitting SS
Degrees of
freedom
å( bˆ(xi - x ))2
1
å(y - bˆx - aˆ )
n-2
2
i
i
å(y - y )
2
i
i
å( bˆ(x - x ))
(n - 2)å( bˆ(xi - x ))2
2
i
å(y - bˆx - aˆ )
i
i
Total SS
F-ratio
(ratio of FMS/RMS)
i
i
Residual SS
Mean Square
(ratio of SS to df)
i
n-1
i
i
2
/(n - 2)
å(y - bˆx - aˆ )
i
i
i
2
ANOVA in R
Component
SS
Fitting SS
Residual SS
Total SS
20872.7
605.6
21478.3
Degrees Mean Square
of
freedom
1
20872.7
28
21.6
29
F-ratio
965
> anova(f)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value
Pr(>F)
x
1 20872.7 20872.7
965 < 2.2e-16 ***
Residuals 28
605.6
21.6
> pf(965,1,28,lower.tail=FALSE)
[1] 3.042279e-23
Hypothesis testing
•
no relationship between y and x
• Assume errors ei are independent and normally distributed N(0,s2)
• If H0 is true then the expected values of the sums of squares in the ANOVA
are
å(y - y )
2
i
i
• degrees freedom n -1
• Expectation
(n -1)s 2
=
å(b̂(x - x ))
2
i
i
1
s2
• F ratio = (fitting MS)/(residual MS) ~ 1 under H0
• F >> 1 implies we reject H0
• F is distributed as F(1,n-2)
+
å(y - b̂x - â)
i
i
i
n-2
(n - 2)s 2
2
Degrees of Freedom
•
Suppose
•
Then
•
What about
•
These values are constrained to sum to 0:
•
Therefore the sum is distributed as if it comprised one fewer observation,
hence it has n-1 df (for example, its expectation is n-1)
•
In particular, if p parameters are estimated from a data set, then the residuals
are iid N(0,1)
ie n independent variables
?
have p constraints on them, so they behave like n-p independent variables
The F distribution
• If e1….en are independent and identically distributed (iid)
random variables with distribution N(0,s2), then:
• e12/s2 … en2/s2 are each iid chi-squared random variables with
chi-squared distribution on 1 degree of freedom c12
• The sum Sn = Si ei2/s2 is distributed as chi-squared cn2
• If Tm is a similar sum distributed as chi-squared cm2, but
independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F
random variable F(n,m)
• Special cases:
– F(1,m) is the same as the square of a T-distribution on m df
– for large m, F(n,m) tends to cn2
ANOVA – HDL example
> ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol)
> ff
Call:
lm(formula = bioch$Biochem.HDL ~
bioch$Biochem.Tot.Cholesterol)
Coefficients:
(Intercept)
0.2308
bioch$Biochem.Tot.Cholesterol
0.4456
> anova(ff)
Analysis of Variance Table
Response: bioch$Biochem.HDL
Df Sum Sq Mean Sq F value
Pr(>F)
bioch$Biochem.Tot.Cholesterol
1 149.660 149.660
1044
Residuals
1849 265.057
0.143
> pf(1044,1,28,lower.tail=FALSE)
[1] 1.040709e-23
HDL = 0.2308 + 0.4456*Cholesterol
correlation and ANOVA
•
r2 = FSS/TSS = fraction of variance explained by the model
•
r2 = F/(F+n-2)
–
–
–
correlation and ANOVA are equivalent
Test of r=0 is equivalent to test of b=0
T statistic in R cor.test is the square root of the ANOVA F statistic
–
–
r does not tell anything about magnitudes of estimates of a, b
r is dimensionless
Effect of sample size on significance
Total Cholesterol vs HDL data
Example R session to sample subsets of data and compute correlations
seqq <- seq(10,300,5)
corr <- matrix(0,nrow=length(seqq),ncol=2)
colnames(corr) <- c( "sample size", "P-value")
n <- 1
for(i in seqq) {
res <- rep(0,100)
for(j in 1:100) {
s <- sample(idx,i)
data <- bioch[s,]
co <- cor.test(data$Biochem.Tot.Cholesterol,
data$Biochem.HDL,na="pair")
res[j] <- co$p.value
}
m <- exp(mean(log(res)))
cat(i, m, "\n")
corr[n,] <- c(i, m)
n <- n+1
}
Calculating the right sample size n
• The R library “pwr” contains functions to
compute the sample size for many problems,
including correlation pwr.r.test() and ANOVA
pwr.anova.test()
Problems with non-linearity
All plots have r=0.8 (taken from Wikipedia)
Multiple Correlation
• The R cor function can be used to compute pairwise
correlations between many variables at once,
producing a correlation matrix.
• This is useful for example, when comparing expression
of genes across subjects.
• Gene coexpression networks are often based on the
correlation matrix.
• in R
mat <- cor(df, na=“pair”)
– computes the correlation between every pair of columns
in df, removing missing values in a pairwise manner
– Output is a square matrix correlation coefficients
One-Way ANOVA
• Model y as a function of a categorical variable
taking p values
– eg subjects are classified into p families
– want to estimate effect due to each family and
test if these are different
– want to estimate the fraction of variance
explained by differences between families – ( an
estimate of heritability)
One-Way ANOVA
LS estimators
average over group i
One-Way ANOVA
• Variance is partitioned in to fitting and
residual SS
total SS
fitting SS
between groups
residual SS
with groups
n-1
p-1
n-p
degrees of freedom
One-Way ANOVA
Component
SS
Degrees of
freedom
Fitting SS
p-1
Residual SS
n-p
Total SS
n-1
Mean Square
(ratio of SS to df)
Under Ho: no differences between groups
F ~ F(p-1,n-p)
F-ratio
(ratio of FMS/RMS)
One-Way ANOVA in R
fam <- lm( bioch$Biochem.HDL ~ bioch$Family )
> anova(fam)
Analysis of Variance Table
Response: bioch$Biochem.HDL
Df Sum Sq Mean Sq F value
Pr(>F)
bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 ***
Residuals
1727 18.5478 0.0107
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
Component
Fitting SS
SS
Degrees of
freedom
Mean Square
(ratio of SS to df)
F-ratio
(ratio of FMS/RMS)
3.4375
6.3870
173
0.0369
Residual SS
18.5478
1727
0.0107
Total SS
24.9348
1900
Factors in R
• Grouping variables in R are called factors
• When a data frame is read with read.table()
– a column is treated as numeric if all non-missing entries are
numbers
– a column is boolean if all non-missing entries are T or F (or TRUE
or FALSE)
– a column is treated as a factor otherwise
– the levels of the factor are the set of distinct values
– A column can be forced to be treated as a factor using the
function as.factor(), or as a numeric vector using
as.numeric()
– BEWARE: If a numeric column contains non-numeric values (eg
“N” being used instead of “NA” for a missing value, then the
column is interpreted as a factor
Linear Modelling in R
• The R function lm() fits linear models
• It has two principal arguments (and some
optional ones)
• f <- lm( formula, data )
– formula is an R formula
– data is the name of the data frame containing the
data (can be omitted, if the variables in the
formula include the data frame)
formulae in R
• Biochem.HDL ~ Biochem$Tot.Cholesterol
– linear regression of HDL on Cholesterol
– 1 df
• Biochem.HDL ~ Family
– one-way analysis of variance of HDL on Family
– 173 df
• The degrees of freedom are the number of
independent parameters to be estimated
ANOVA in R
•
•
f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem)
[OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)]
•
a <- anova(f)
•
•
f <- lm(Biochem.HDL ~ Family, data=biochem)
a <- anova(f)
Non-parametric Methods
• So far we have assumed the errors in the data are Normally
distributed
• P-values and estimates can be inaccurate if this is not the case
• Non-parametric methods are a (partial) way round this
problem
• Make fewer assumptions about the distribution of the data
– independent
– identically distributed
Non-Parametric Correlation
Spearman Rank Correlation Coefficient
• Replace observations by their ranks
• eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4)
• Compute sum of squared differences between
ranks
• in R:
– cor( x, y, method=“spearman”)
– cor.test(x,y,method=“spearman”)
Spearman Correlation
> cor.test(xx,y, method=“pearson”)
Pearson's product-moment correlation
data: xx and y
t = 0.0221, df = 28, p-value = 0.9825
alternative hypothesis: true correlation is not
equal to 0
95 percent confidence interval:
-0.3566213 0.3639062
sample estimates:
cor
0.004185729
> cor.test(xx,y,method="spearman")
Spearman's rank correlation rho
data: xx and y
S = 2473.775, p-value = 0.01267
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.4496607
Non-Parametric
One-Way ANOVA
• Kruskall-Wallis Test
• Useful if data are highly non-Normal
– Replace data by ranks
– Compute average rank within each group
– Compare averages
– kruskal.test( formula, data )
Permutation Tests
as non-parametric tests
• Example: One-way ANOVA:
– permute group identity between subjects
– count fraction of permutations in which the
ANOVA p-value is smaller than the true p-value
a = anova(lm( bioch$Biochem.HDL ~ bioch$Family))
p = a[1,5]
pv = rep(0,1000)
for( i in 1:1000) {
perm = sample(bioch$Family,replace=FALSE)
a = anova(lm( bioch$Biochem.HDL ~ perm ))
pv[i] = a[1,5]
}
pval = mean(pv <p)
Download