Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Lecture 4 Multivariate Analyses I: Specific Models MBP1010 † Dr. Paul C. Boutros Winter 2015 DEPARTMENT OF MEDICAL BIOPHYSICS † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Microarray Analysis I: Pre-Processing Lecture 7: Microarray Analysis II: Multiple-Testing Lecture 8: Data Visualization & Machine-Learning Lecture 9: Sequence Analysis Basics Final Exam (written) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca How Will You Be Graded? • 9% Participation: 1% per week • 56% Assignments: 8 x 7% each • 35% Final Examination: in-class • Each individual will get their own, unique assignment • Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run) • Final Exam will include multiple-choice and written answers Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca House Rules • Cell phones to silent • No side conversations • Hands up for questions Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Topics For This Week • Review to date • Cervix Cancer Genomes • Attendance • Multivariate analyses Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Review From Lecture #1 Population vs. Sample All MBP Students = Population MBP Students in 1010 = Sample How do you report statistical information? P-value, variance, effect-size, sample-size, test Why don’t we use Excel/spreadsheets? Input errors, reproducibility, wrong results Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Review From Lecture #2 Define discrete data Has gaps on the number-line What is the central limit theorem? A random variable that is the sum of many small random variables is normally distributed Theoretical vs. empirical quantiles Probability vs. percentage of values less than p Components of a boxplot? 25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Review From Lecture #2 How can you interpret a QQ plot? Compares two samples or a sample and a distribution. Straight line indicates identity. What is hypothesis testing? Confirmatory data-analysis; test null hypothesis What is a p-value? Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Review From Lecture #2 Parametric vs. non-parametric tests Parametric tests have distributional assumptions What is the t-statistics? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Equal Variance (F-Test)? Yes Homoscedastic T-Test Yes Sufficient n for CLT (>30)? No Heteroscedastic T-Test Lecture 4: Multivariate Analyses I: Specific Cases No Wilcoxon U-Test bioinformatics.ca Review From Lecture #2 Parametric vs. non-parametric tests Parametric tests have distributional assumptions What is the t-statistics? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Review From Lecture #3 What is statistical power? Probability a test will incorrect reject the null AKA sensitivity or 1- false-negative rate What is a correlation? A relationship between two (random) variables Common correlation metrics? Pearson, Spearman, Kendall Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Review From Lecture #3 What is a ceRNA? An endogenous RNA that “soaks up” miRNAs to prevent their activity on another endogenous RNA What are the major univariate discrete tests? Pearson’s Chi-Squared, Fisher’s Exact, Proportion, Hypergeometric Common correlation metrics? Pearson, Spearman, Kendall Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Four Main Discrete Univariate Tests • Hypergeometric test • Is a sample randomly selected from a fixed population? • Proportion test • Are two proportions equivalent? • Fisher’s Exact test • Are two binary classifications associated? • (Pearson’s) Chi-Squared Test • Are paired observations on two variables independent? Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Cervix Cancer 101 • Diesease burden increasing • (~380k to ~450k in the last 30 years) • By age 50, >80% of women have HPV infection • >75% of sexually active women exposed, only a subset affected • Why is nearly totally unknown! • Tightly Associated with Poverty Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca HPV Infection Associated Cancers • • • • • • Cervix Anal Vaginal Vulvar Penile Head & Neck >99% ~85% ~70% ~40% ~45% ~20-30% Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable. Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Figure 1 is a Classic Sequencing Figure Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca What Statistical Analysis Did They Do? • Lots of information in Supplementary, but in large part citations to previous work • Main text, mutation rate vs. histology compared using Wilcoxon • Reported p-value, sample-size, effect-size • They did incredibly good reporting in supplementary. For example... Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Attendance Break Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Problem When we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases. Regression analyzes the dependence. Examples: • Height vs. weight • Gene dosage vs. expression level • Survival analysis: probability of death vs. age Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Correlation When one variable depends on the other, the variables are to some degree correlated. (Note: correlation need not imply causality.) In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance). Pearson's coeffecient of correlation values range from -1 to 1, with 0 indicating no correlation. Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Modeling Specify Model Estimate Parameters No Adequate? Yes Use Model Lecture 4: Multivariate Analyses I: Specific Cases Linear regression is one possible model we can apply to data analysis. A model in the statistician's sense might not be what you think ... it is merely a device to explain data. While it may help you consider mechanisms and causalities, it is not necessarily a representation of any particular physical or biological mechanism. Remember: correlation does not entail causation. bioinformatics.ca Types of regression Linear regression assumes a particular model: yi = a + bxi + ei xi is the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable." yi is the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable." i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The i are assumed to be independent and N(0,2) (normally distributed), they can also be called residuals. This model has two parameters: the regression coefficient , and the intercept . Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression • Assumptions: • Only two variables are of interest • One variable is a response and one a predictor • No adjustment is needed for confounding or other between-subject variation • Linearity • σ2 is constant, independent of x i are independent of each other • For proper statistical inference (CI, p-values), i are normal distributed Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression Linear regression analysis includes: • estimation of the parameters; • characterization how good the model is. Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: estimation Parameter estimation: choose parameters that come as close as possible to the "true" values. Problem: how do we distinguish "good" from "poor" estimates? One possibility: minimize the Sum of Squared Errors SSE In a general sense, for a sample S = {( x1 ,y1 ),( x2 ,y2 ),… , ( xn ,yn )} and a model M, n SSE = å ( yi - M ( xi )) 2 i=1 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: estimation For a linear model, with the estimated parameters a, b n SSE = å ( yi - a - b( xi )) 2 i=1 Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates. This method of least squares has an analytic solution for the linear case. Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight synthHWsamples <- function(n) { set.seed(83) # parameters for a height vs. weight plot hmin <- 1.5 hmax <- 2.3 M <- matrix(nrow=n,ncol=2) # generate a column of numbers in the interval M[,1] <- runif(n, hmin, hmax) # generate a column of numbers with a linear model M[,2] <- 40 * M[,1] + 1 # add some errors M[,2] <- M[,2] + rnorm(n, 0, 15) return(M) } Under the parameters used above, a linear regression analysis should show a slope of 40 kg/m and an intercept of 40 kg. It is always a good idea to sanity-test your analysis with synthetic data. After all: if you can't retrieve your model parameters from synthetic data, how could you trust your analysis of real data? Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight > HW<synthHWsamples(50) > plot(HW, xlab="Height (cm)", ylab="Weight (kg)") > cov(HW[,1], HW[,2]) [1] 2.498929 > cor(HW[,1], HW[,2]) [1] 0.5408063 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Pearson's Coefficient of Correlation How to interpret the correlation coefficient: Explore varying degrees of randomness ... > x<-rnorm(50) > r <- 0.99; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9999666 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Pearson's Coefficient of Correlation Varying degrees of randomness ... > x<-rnorm(50) > r <- 0.8; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9661111 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Estimate a linear model to recover parameters: > lm(HW[,2] ~ HW[,1]) Call: lm(formula = HW[, 2] ~ HW[, 1]) Coefficients: (Intercept) -2.86 dat[, 1] 42.09 > abline(-2.86, 42.09) or: abline(lm(HW[,2] ~ HW[,1]), col=rgb(192/255, 80/255, 77/255), lwd=3) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Extract information: > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Extract information: > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Extract information: > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Extract information: > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: quality control Intepreting the results has two parts: 1: Is the model adequate? (Residuals) 2: Are the parameter estimates good? (Confidence limits) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: residuals Residuals: The solid red line is the least-squares-fit line (regression line), defined by a particular slope and intercept. The red lines between the regression line and the actual data points are the residuals. Residuals are "signed" i.e. negative if an observation is smaller than the corresponding value of the regression line. Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: quality control Residual plots allow us to validate underlying assumptions: –Relationship between response and regressor should be linear (at least approximately). –Error term, should have zero mean –Error term, should have constant variance –Errors should be normally distributed (required for tests and intervals) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: quality control Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Check constant variance and linearity, and look for potential outliers. What does our synthetic data look like, regarding this aspect? Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Get residuals: res <- resid(lm(HW[,2] ~ HW[,1])) Get idealized values: fit <- fitted(lm(HW[,2] ~ HW[,1])) Plot differences: segments(HW[,1], HW[,2], HW[,1], fit, col=2) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight fit vs. residuals > plot(fit, res) > cor(fit, res) [1] -1.09228e-16 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: Q-Q plot Adequate Inadequate Residuals vs. similarly distributed normal deviates check the normality assumption Inadequate Inadequate Inadequate Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Cumulative probability of normal distribution Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Q-Q plot: are the residuals normally distributed? qqnorm(res) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression: Evaluating accuracy If the model is valid, i.e. nothing terrible in the residuals, we can use it to predict. But how good is the prediction? Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight prediction and confidence limits > pp<-predict(lm(HW[,2] ~ HW[,1]), int="p") Warning message: In predict.lm(lm(HW[, 2] ~ HW[, 1]), int = "p") : Predictions on current data refer to _future_ responses > pc<-predict(lm(HW[,2] ~ HW[,1]), int="c") > head(pc) fit lwr upr 1 60.57098 51.45048 69.69148 2 67.98277 61.53194 74.43360 3 77.96070 73.37784 82.54356 4 92.04435 84.23698 99.85171 5 76.34929 71.70340 80.99518 6 76.57656 71.94643 81.20670 Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Plot pp and pc limits a: sort on x o <- order(HW[,1]) HW2 <- HW[o,] b: recompute pp, pc pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c") pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p") c: plot > plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp)) > matlines (HW2[,1], pc, lty=c(1,2,2), col="black") > matlines (HW2[,1], pp, lty=c(1,3,3), col="red") Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Linear regression example: height vs. weight Plot pp and pc limits interval a: sort onprediction x (at p=0.95) o <- order(HW[,1]) (values) HW2 <- HW[o,] b: recompute pp, pc pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c") pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p") confidence interval (at p=0.95) (parameters) c: plot > plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp)) > matlines (HW2[,1], pc, lty=c(1,2,2), col="black") > matlines (HW2[,1], pp, lty=c(1,3,3), col="red") Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Lots of Analyses Are Linear Regressions Y = a0 + a1x1 x1 continuous Linear Regression Y = a0 + a1x1 Y factorial Logistic Regression Y = a0 + a1x1 x1 factorial 1-way ANOVA Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca One-Way ANOVAs in R # have a list of groups x <- as.factor(rep(c(‘A’,‘B’,C’), 3)); # and some continuous data y <- rnorm(9); # fit a one-way anova with: tmp <- aov(y ~ x); # get a p-value with: summary(tmp); Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Microarray Analysis I: Pre-Processing Lecture 7: Microarray Analysis II: Multiple-Testing Lecture 8: Data Visualization & Machine-Learning Lecture 9: Sequence Analysis Basics Final Exam (written) Lecture 4: Multivariate Analyses I: Specific Cases bioinformatics.ca