Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics Synopsis • • • • Linear Regression Correlation Analysis of Variance Principle of Least Squares Correlation Correlation and linear regression • • • • Is there a relationship? How do we summarise it? Can we predict new obs? What about outliers? Correlation Coefficient r • -1 < r < 1 • r=1 perfect positive linear • r=0 no relationship • r=-1 perfect negative linear • r=0.6 Examples of Correlation (taken from Wikipedia) Calculation of r • Data Correlation in R > cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") [1] 0.2577617 > cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") Pearson's product-moment correlation data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = 11.1473, df = 1746, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2134566 0.3010088 sample estimates: cor 0.2577617 > pt(11.1473,df=1746,lower.tail=FALSE) [1] 3.154319e-28 # T distribution on 1746 degrees of freedom Linear Regression Fit a straight line to data • a • b • ei intercept slope error – Normally distributed – E(ei) = 0 – Var(ei) = s2 Example: simulated data R code > > > > > > # x e x e y simulate 30 data points <- rnorm(30) <- rnorm(30) <- 1:30 <- rnorm(30,0,5) <- 1 + 3*x + e > # fit the linear model > f <- lm(y ~ x) > # plot the data and the predicted line > plot(x,y) > abline(reg=f) > print(f) Call: lm(formula = y ~ x) Coefficients: (Intercept) -0.08634 x 3.04747 Least Squares • Estimate a, b by least squares • Minimise sum of squared residuals between y and the prediction a+bx • Minimise Why least squares? • LS gives simple formulae for the estimates for a, b • If the errors are Normally distributed then the LS estimates are “optimal” In large samples the estimates converge to the true values No other estimates have smaller expected errors LS = maximum likelihood • Even if errors are not Normal, LS estimates are often useful Analysis of Variance (ANOVA) LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components 2 2 2 ˆ ˆ ˆ (y y ) = ( b (x x )) + (y b x a ) å i å i å i i i i i • total SS = fitting SS + residual SS • only the LS estimates do this Component Fitting SS Degrees of freedom å( bˆ(xi - x ))2 1 å(y - bˆx - aˆ ) n-2 2 i i å(y - y ) 2 i i å( bˆ(x - x )) (n - 2)å( bˆ(xi - x ))2 2 i å(y - bˆx - aˆ ) i i Total SS F-ratio (ratio of FMS/RMS) i i Residual SS Mean Square (ratio of SS to df) i n-1 i i 2 /(n - 2) å(y - bˆx - aˆ ) i i i 2 ANOVA in R Component SS Fitting SS Residual SS Total SS 20872.7 605.6 21478.3 Degrees Mean Square of freedom 1 20872.7 28 21.6 29 F-ratio 965 > anova(f) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 20872.7 20872.7 965 < 2.2e-16 *** Residuals 28 605.6 21.6 > pf(965,1,28,lower.tail=FALSE) [1] 3.042279e-23 Hypothesis testing • no relationship between y and x • Assume errors ei are independent and normally distributed N(0,s2) • If H0 is true then the expected values of the sums of squares in the ANOVA are å(y - y ) 2 i i • degrees freedom n -1 • Expectation (n -1)s 2 = å(b̂(x - x )) 2 i i 1 s2 • F ratio = (fitting MS)/(residual MS) ~ 1 under H0 • F >> 1 implies we reject H0 • F is distributed as F(1,n-2) + å(y - b̂x - â) i i i n-2 (n - 2)s 2 2 Degrees of Freedom • Suppose • Then • What about • These values are constrained to sum to 0: • Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1) • In particular, if p parameters are estimated from a data set, then the residuals are iid N(0,1) ie n independent variables ? have p constraints on them, so they behave like n-p independent variables The F distribution • If e1….en are independent and identically distributed (iid) random variables with distribution N(0,s2), then: • e12/s2 … en2/s2 are each iid chi-squared random variables with chi-squared distribution on 1 degree of freedom c12 • The sum Sn = Si ei2/s2 is distributed as chi-squared cn2 • If Tm is a similar sum distributed as chi-squared cm2, but independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F random variable F(n,m) • Special cases: – F(1,m) is the same as the square of a T-distribution on m df – for large m, F(n,m) tends to cn2 ANOVA – HDL example > ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) > ff Call: lm(formula = bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) Coefficients: (Intercept) 0.2308 bioch$Biochem.Tot.Cholesterol 0.4456 > anova(ff) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Biochem.Tot.Cholesterol 1 149.660 149.660 1044 Residuals 1849 265.057 0.143 > pf(1044,1,28,lower.tail=FALSE) [1] 1.040709e-23 HDL = 0.2308 + 0.4456*Cholesterol correlation and ANOVA • r2 = FSS/TSS = fraction of variance explained by the model • r2 = F/(F+n-2) – – – correlation and ANOVA are equivalent Test of r=0 is equivalent to test of b=0 T statistic in R cor.test is the square root of the ANOVA F statistic – – r does not tell anything about magnitudes of estimates of a, b r is dimensionless Effect of sample size on significance Total Cholesterol vs HDL data Example R session to sample subsets of data and compute correlations seqq <- seq(10,300,5) corr <- matrix(0,nrow=length(seqq),ncol=2) colnames(corr) <- c( "sample size", "P-value") n <- 1 for(i in seqq) { res <- rep(0,100) for(j in 1:100) { s <- sample(idx,i) data <- bioch[s,] co <- cor.test(data$Biochem.Tot.Cholesterol, data$Biochem.HDL,na="pair") res[j] <- co$p.value } m <- exp(mean(log(res))) cat(i, m, "\n") corr[n,] <- c(i, m) n <- n+1 } Calculating the right sample size n • The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test() Problems with non-linearity All plots have r=0.8 (taken from Wikipedia) Multiple Correlation • The R cor function can be used to compute pairwise correlations between many variables at once, producing a correlation matrix. • This is useful for example, when comparing expression of genes across subjects. • Gene coexpression networks are often based on the correlation matrix. • in R mat <- cor(df, na=“pair”) – computes the correlation between every pair of columns in df, removing missing values in a pairwise manner – Output is a square matrix correlation coefficients One-Way ANOVA • Model y as a function of a categorical variable taking p values – eg subjects are classified into p families – want to estimate effect due to each family and test if these are different – want to estimate the fraction of variance explained by differences between families – ( an estimate of heritability) One-Way ANOVA LS estimators average over group i One-Way ANOVA • Variance is partitioned in to fitting and residual SS total SS fitting SS between groups residual SS with groups n-1 p-1 n-p degrees of freedom One-Way ANOVA Component SS Degrees of freedom Fitting SS p-1 Residual SS n-p Total SS n-1 Mean Square (ratio of SS to df) Under Ho: no differences between groups F ~ F(p-1,n-p) F-ratio (ratio of FMS/RMS) One-Way ANOVA in R fam <- lm( bioch$Biochem.HDL ~ bioch$Family ) > anova(fam) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 *** Residuals 1727 18.5478 0.0107 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > Component Fitting SS SS Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) 3.4375 6.3870 173 0.0369 Residual SS 18.5478 1727 0.0107 Total SS 24.9348 1900 Factors in R • Grouping variables in R are called factors • When a data frame is read with read.table() – a column is treated as numeric if all non-missing entries are numbers – a column is boolean if all non-missing entries are T or F (or TRUE or FALSE) – a column is treated as a factor otherwise – the levels of the factor are the set of distinct values – A column can be forced to be treated as a factor using the function as.factor(), or as a numeric vector using as.numeric() – BEWARE: If a numeric column contains non-numeric values (eg “N” being used instead of “NA” for a missing value, then the column is interpreted as a factor Linear Modelling in R • The R function lm() fits linear models • It has two principal arguments (and some optional ones) • f <- lm( formula, data ) – formula is an R formula – data is the name of the data frame containing the data (can be omitted, if the variables in the formula include the data frame) formulae in R • Biochem.HDL ~ Biochem$Tot.Cholesterol – linear regression of HDL on Cholesterol – 1 df • Biochem.HDL ~ Family – one-way analysis of variance of HDL on Family – 173 df • The degrees of freedom are the number of independent parameters to be estimated ANOVA in R • • f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem) [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)] • a <- anova(f) • • f <- lm(Biochem.HDL ~ Family, data=biochem) a <- anova(f) Non-parametric Methods • So far we have assumed the errors in the data are Normally distributed • P-values and estimates can be inaccurate if this is not the case • Non-parametric methods are a (partial) way round this problem • Make fewer assumptions about the distribution of the data – independent – identically distributed Non-Parametric Correlation Spearman Rank Correlation Coefficient • Replace observations by their ranks • eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4) • Compute sum of squared differences between ranks • in R: – cor( x, y, method=“spearman”) – cor.test(x,y,method=“spearman”) Spearman Correlation > cor.test(xx,y, method=“pearson”) Pearson's product-moment correlation data: xx and y t = 0.0221, df = 28, p-value = 0.9825 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3566213 0.3639062 sample estimates: cor 0.004185729 > cor.test(xx,y,method="spearman") Spearman's rank correlation rho data: xx and y S = 2473.775, p-value = 0.01267 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4496607 Non-Parametric One-Way ANOVA • Kruskall-Wallis Test • Useful if data are highly non-Normal – Replace data by ranks – Compute average rank within each group – Compare averages – kruskal.test( formula, data ) Permutation Tests as non-parametric tests • Example: One-way ANOVA: – permute group identity between subjects – count fraction of permutations in which the ANOVA p-value is smaller than the true p-value a = anova(lm( bioch$Biochem.HDL ~ bioch$Family)) p = a[1,5] pv = rep(0,1000) for( i in 1:1000) { perm = sample(bioch$Family,replace=FALSE) a = anova(lm( bioch$Biochem.HDL ~ perm )) pv[i] = a[1,5] } pval = mean(pv <p)