bioinformatics - Department of Medical Biophysics

advertisement
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 4
Multivariate Analyses I: Specific Models
MBP1010
†
Dr. Paul C. Boutros
Winter 2015
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Microarray Analysis I: Pre-Processing
Lecture 7: Microarray Analysis II: Multiple-Testing
Lecture 8: Data Visualization & Machine-Learning
Lecture 9: Sequence Analysis Basics
Final Exam (written)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
How Will You Be Graded?
• 9% Participation: 1% per week
• 56% Assignments: 8 x 7% each
• 35% Final Examination: in-class
• Each individual will get their own, unique assignment
• Assignments will all be in R, and will be graded according
to computational correctness only (i.e. does your R script
yield the correct result when run)
• Final Exam will include multiple-choice and written
answers
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Topics For This Week
• Review to date
• Cervix Cancer Genomes
• Attendance
• Multivariate analyses
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Review From Lecture #1
Population vs. Sample
All MBP Students = Population
MBP Students in 1010 = Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why don’t we use Excel/spreadsheets?
Input errors, reproducibility, wrong results
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Review From Lecture #2
Define discrete data
Has gaps on the number-line
What is the central limit theorem?
A random variable that is the sum of many
small random variables is normally distributed
Theoretical vs. empirical quantiles
Probability vs. percentage of values less than p
Components of a boxplot?
25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Review From Lecture #2
How can you interpret a QQ plot?
Compares two samples or a sample and a
distribution. Straight line indicates identity.
What is hypothesis testing?
Confirmatory data-analysis; test null hypothesis
What is a p-value?
Evidence against null; probability of FP,
probability of seeing as extreme a value by
chance alone
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Review From Lecture #2
Parametric vs. non-parametric tests
Parametric tests have distributional assumptions
What is the t-statistics?
Signal:Noise ratio
Assumptions of the t-test?
Data sampled from normal distribution;
independence of replicates; independence of
groups; homoscedasticity
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Flow-Chart For Two-Sample Tests
Is Data Sampled From a
Normally-Distributed Population?
Yes
No
Equal Variance
(F-Test)?
Yes
Homoscedastic
T-Test
Yes
Sufficient n for
CLT (>30)?
No
Heteroscedastic
T-Test
Lecture 4: Multivariate Analyses I: Specific Cases
No
Wilcoxon
U-Test
bioinformatics.ca
Review From Lecture #2
Parametric vs. non-parametric tests
Parametric tests have distributional assumptions
What is the t-statistics?
Signal:Noise ratio
Assumptions of the t-test?
Data sampled from normal distribution;
independence of replicates; independence of
groups; homoscedasticity
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Review From Lecture #3
What is statistical power?
Probability a test will incorrect reject the null
AKA sensitivity or 1- false-negative rate
What is a correlation?
A relationship between two (random) variables
Common correlation metrics?
Pearson, Spearman, Kendall
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Review From Lecture #3
What is a ceRNA?
An endogenous RNA that “soaks up” miRNAs to
prevent their activity on another endogenous
RNA
What are the major univariate discrete tests?
Pearson’s Chi-Squared, Fisher’s Exact,
Proportion, Hypergeometric
Common correlation metrics?
Pearson, Spearman, Kendall
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Four Main Discrete Univariate Tests
• Hypergeometric test
• Is a sample randomly selected from a fixed population?
• Proportion test
• Are two proportions equivalent?
• Fisher’s Exact test
• Are two binary classifications associated?
• (Pearson’s) Chi-Squared Test
• Are paired observations on two variables independent?
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
When Do We Use Statistics?
• Ubiquitous in modern biology
• Every class I will show a use of statistics in a (very, very)
recent Nature paper.
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Cervix Cancer 101
• Diesease burden increasing
• (~380k to ~450k in the last 30 years)
• By age 50, >80% of women have HPV infection
• >75% of sexually active women exposed, only a subset affected
• Why is nearly totally unknown!
• Tightly Associated with Poverty
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
HPV Infection Associated Cancers
•
•
•
•
•
•
Cervix
Anal
Vaginal
Vulvar
Penile
Head & Neck
>99%
~85%
~70%
~40%
~45%
~20-30%
Of course not all of these are the HPV subtypes
caught by current vaccines, but a majority are.
Thus many cancers are preventable.
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Figure 1 is a Classic Sequencing Figure
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
What Statistical Analysis Did They Do?
• Lots of information in Supplementary, but in large part
citations to previous work
• Main text, mutation rate vs. histology compared using
Wilcoxon
• Reported p-value, sample-size, effect-size
• They did incredibly good reporting in supplementary. For
example...
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Attendance Break
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Problem
When we measure more one
than one variable for each
member of a population, a
scatter plot may show us that
the values are not completely
independent: there is e.g. a
trend for one variable to
increase
as
the
other
increases.
Regression analyzes the
dependence.
Examples:
• Height vs. weight
• Gene dosage vs.
expression level
• Survival analysis:
probability of death vs. age
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Correlation
When one variable depends on
the other, the variables are to
some degree correlated.
(Note: correlation need not
imply causality.)
In R, the function cov()
measures covariance and cor()
measures the Pearson
coefficient of correlation (a
normalized measure of
covariance).
Pearson's coeffecient of
correlation values range
from -1 to 1, with 0 indicating
no correlation.
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Modeling
Specify Model
Estimate Parameters
No
Adequate?
Yes
Use Model
Lecture 4: Multivariate Analyses I: Specific Cases
Linear regression is one
possible model we can apply to
data analysis.
A model in the statistician's
sense might not be what you
think ... it is merely a device to
explain data. While it may help
you consider mechanisms and
causalities, it is not necessarily a
representation of any particular
physical or biological
mechanism.
Remember: correlation does not
entail causation.
bioinformatics.ca
Types of regression
Linear regression assumes a particular model:
yi = a + bxi + ei
xi is the independent variable. Depending on the context, also known as
a "predictor variable," "regressor," "controlled variable," "manipulated variable,"
"explanatory variable," "exposure variable," and/or "input variable."
yi is the dependent variable, also known as "response variable,"
"regressand," "measured variable," "observed variable," "responding variable,"
"explained variable," "outcome variable," "experimental variable," and/or "output
variable."
i are "errors" - not in the sense of being "wrong", but in the sense of creating
deviations from the idealized model. The i are assumed to be independent and N(0,2)
(normally distributed), they can also be called residuals.
This model has two parameters: the regression coefficient
, and the intercept .
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression
• Assumptions:
• Only two variables are of interest
• One variable is a response and one a predictor
• No adjustment is needed for confounding or other
between-subject variation
• Linearity
• σ2 is constant, independent of x
 i are independent of each other
• For proper statistical inference (CI, p-values), i are normal
distributed
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression
Linear regression analysis includes:
•
estimation of the parameters;
•
characterization how good the model is.
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: estimation
Parameter estimation: choose parameters that come as
close as possible to the "true" values.
Problem: how do we distinguish "good" from "poor"
estimates?
One possibility: minimize the Sum of Squared Errors
SSE
In a general sense, for a sample S = {( x1 ,y1 ),( x2 ,y2 ),… , ( xn ,yn )}
and a model M,
n
SSE = å ( yi - M ( xi ))
2
i=1
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: estimation
For a linear model, with the estimated parameters a, b
n
SSE = å ( yi - a - b( xi ))
2
i=1
Estimation: choose parameters a, b so that the SSE is
as small as possible. We call these: least squares
estimates.
This method of least squares has an analytic solution
for the linear case.
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
synthHWsamples <- function(n) {
set.seed(83)
# parameters for a height vs. weight plot
hmin <- 1.5
hmax <- 2.3
M <- matrix(nrow=n,ncol=2)
# generate a column of numbers in the interval
M[,1] <- runif(n, hmin, hmax)
# generate a column of numbers with a linear model
M[,2] <- 40 * M[,1] + 1
# add some errors
M[,2] <- M[,2] + rnorm(n, 0, 15)
return(M)
}
Under the parameters used above, a linear regression analysis should show a slope
of 40 kg/m and an intercept of 40 kg.
It is always a good idea to sanity-test your analysis with synthetic data. After all: if
you can't retrieve your model parameters from synthetic data, how could you trust
your analysis of real data?
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
> HW<synthHWsamples(50)
> plot(HW, xlab="Height
(cm)", ylab="Weight (kg)")
> cov(HW[,1], HW[,2])
[1] 2.498929
> cor(HW[,1], HW[,2])
[1] 0.5408063
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Pearson's Coefficient of Correlation
How to interpret the correlation
coefficient:
Explore varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.99;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.9999666
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.8;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.9661111
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Estimate a linear model
to recover parameters:
> lm(HW[,2] ~ HW[,1])
Call:
lm(formula = HW[, 2] ~ HW[, 1])
Coefficients:
(Intercept)
-2.86
dat[, 1]
42.09
> abline(-2.86, 42.09)
or:
abline(lm(HW[,2] ~ HW[,1]), col=rgb(192/255, 80/255, 77/255), lwd=3)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Extract information:
> summary(lm(HW[,2] ~ HW[,1]))
Call:
lm(formula = HW[, 2] ~ HW[, 1])
Residuals:
Min
1Q Median
3Q
Max
-36.490 -10.297 3.426 9.156 37.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.860 18.304 -0.156 0.876
HW[, 1]
42.090
9.449 4.454 5.02e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.12 on 48 degrees of freedom
Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777
F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
> summary(lm(HW[,2] ~ HW[,1]))
Call:
lm(formula = HW[, 2] ~ HW[, 1])
Residuals:
Min
1Q Median
3Q
Max
-36.490 -10.297 3.426 9.156 37.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.860 18.304 -0.156 0.876
HW[, 1]
42.090
9.449 4.454 5.02e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.12 on 48 degrees of freedom
Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777
F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Extract information:
> summary(lm(HW[,2] ~ HW[,1]))
Call:
lm(formula = HW[, 2] ~ HW[, 1])
Residuals:
Min
1Q Median
3Q
Max
-36.490 -10.297 3.426 9.156 37.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.860 18.304 -0.156 0.876
HW[, 1]
42.090
9.449 4.454 5.02e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.12 on 48 degrees of freedom
Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777
F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Extract information:
> summary(lm(HW[,2] ~ HW[,1]))
Call:
lm(formula = HW[, 2] ~ HW[, 1])
Residuals:
Min
1Q Median
3Q
Max
-36.490 -10.297 3.426 9.156 37.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.860 18.304 -0.156 0.876
HW[, 1]
42.090
9.449 4.454 5.02e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.12 on 48 degrees of freedom
Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777
F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Extract information:
> summary(lm(HW[,2] ~ HW[,1]))
Call:
lm(formula = HW[, 2] ~ HW[, 1])
Residuals:
Min
1Q Median
3Q
Max
-36.490 -10.297 3.426 9.156 37.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.860 18.304 -0.156 0.876
HW[, 1]
42.090
9.449 4.454 5.02e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.12 on 48 degrees of freedom
Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777
F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: quality control
Intepreting the results has two parts:
1: Is the model adequate? (Residuals)
2: Are the parameter estimates good? (Confidence limits)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: residuals
Residuals:
The solid red line is the
least-squares-fit line
(regression line), defined
by a particular slope and
intercept. The red lines
between the regression
line and the actual data
points are the residuals.
Residuals are "signed" i.e.
negative if an observation
is smaller than the
corresponding value of the
regression line.
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: quality control
Residual plots allow us to validate underlying
assumptions:
–Relationship between response and regressor
should be linear (at least approximately).
–Error term,  should have zero mean
–Error term,  should have constant variance
–Errors should be normally distributed (required
for tests and intervals)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: quality control
Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis
Check
constant
variance and
linearity, and
look for
potential
outliers.
What does
our synthetic
data look like,
regarding
this aspect?
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Get residuals:
res <- resid(lm(HW[,2] ~ HW[,1]))
Get idealized values:
fit <- fitted(lm(HW[,2] ~ HW[,1]))
Plot differences:
segments(HW[,1], HW[,2], HW[,1], fit, col=2)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
fit vs. residuals
> plot(fit, res)
> cor(fit, res)
[1] -1.09228e-16
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: Q-Q plot
Adequate
Inadequate Residuals vs.
similarly
distributed
normal deviates
check the
normality
assumption
Inadequate
Inadequate
Inadequate
Source: Montgomery et al.,
2001, Introduction to Linear
Regression Analysis
Cumulative probability
of normal distribution
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Q-Q plot: are the
residuals normally
distributed?
qqnorm(res)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression: Evaluating accuracy
If the model is valid, i.e. nothing terrible in the
residuals, we can use it to predict. But how good is
the prediction?
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
prediction and confidence limits
> pp<-predict(lm(HW[,2] ~ HW[,1]), int="p")
Warning message:
In predict.lm(lm(HW[, 2] ~ HW[, 1]), int = "p") :
Predictions on current data refer to _future_ responses
> pc<-predict(lm(HW[,2] ~ HW[,1]), int="c")
> head(pc)
fit
lwr
upr
1 60.57098 51.45048 69.69148
2 67.98277 61.53194 74.43360
3 77.96070 73.37784 82.54356
4 92.04435 84.23698 99.85171
5 76.34929 71.70340 80.99518
6 76.57656 71.94643 81.20670
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Plot pp and pc limits
a: sort on x
o <- order(HW[,1])
HW2 <- HW[o,]
b: recompute pp, pc
pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c")
pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p")
c: plot
> plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp))
> matlines (HW2[,1], pc, lty=c(1,2,2), col="black")
> matlines (HW2[,1], pp, lty=c(1,3,3), col="red")
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Linear regression example: height vs. weight
Plot pp and pc limits
interval
a: sort onprediction
x
(at p=0.95)
o <- order(HW[,1])
(values)
HW2 <- HW[o,]
b: recompute pp, pc
pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c")
pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p")
confidence interval
(at p=0.95)
(parameters)
c: plot
> plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp))
> matlines (HW2[,1], pc, lty=c(1,2,2), col="black")
> matlines (HW2[,1], pp, lty=c(1,3,3), col="red")
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Lots of Analyses Are Linear Regressions
Y = a0 + a1x1
x1 continuous
Linear Regression
Y = a0 + a1x1
Y factorial
Logistic Regression
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
One-Way ANOVAs in R
# have a list of groups
x <- as.factor(rep(c(‘A’,‘B’,C’), 3));
# and some continuous data
y <- rnorm(9);
# fit a one-way anova with:
tmp <- aov(y ~ x);
# get a p-value with:
summary(tmp);
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Microarray Analysis I: Pre-Processing
Lecture 7: Microarray Analysis II: Multiple-Testing
Lecture 8: Data Visualization & Machine-Learning
Lecture 9: Sequence Analysis Basics
Final Exam (written)
Lecture 4: Multivariate Analyses I: Specific Cases
bioinformatics.ca
Download