Discrete Joint Distributions

advertisement
BIOINF 2118
ANOVA, logistic regression, discriminant analysis, survival analysis p. 1 of 4
ANOVA: Analysis of variance
In “supervised learning” (regression analysis), when a predictor is a CATEGORY, we can replace it
by a series of indicator functions, then do linear regression as before.
See example: hotdogs.R
The “ANOVA” point of view : partitioning the sum of squares of deviations.
For a “one-way layout”, suppose there are n i observations in category i , i = 1,..., p .
Define
We compare the model in which all the individual means are the same:
H0: all
E(Yij ) = m
to the “omnibus” (all-direction) alternative
HA: not all E (Yij ) are the same.
Under either hypothesis,
Replacing the right-hand side by its mean
Under the null hypothesis,
and
.
, we get the estimate
2
sˆ y2 (within) = Swithin
/ (n - p).
.
2
sˆ y2 (between) = Sbetween
/ (p - 1). The two estimates should be similar.
2
But under the alternative, since the true means will differ from each other, Sbetween will be bigger.
So a good test statistic, whose distribution is exactly known when data are normal, is
.
This is a ratio of independent chisquare statistics, so it’s an Fp-1,n-p random variate under H0.
BIOINF 2118
ANOVA, logistic regression, discriminant analysis, survival analysis p. 2 of 4
Logistic regression
When the target (dependent variable) is BINARY, the regression methods we’ve discussed don’t work
well. Instead we use LOGISTIC models:
where “logit” means log odds:
ez
. Its inverse is antilogit( z ) =
.
1+ ez
The likelihood function is
Maximizing the likelihood usually requires an iterative algorithm.
Once we have a model fit, say we can predict for future observations:
.
Logistic regression is a special case of GENERALIZED LINEAR MODELS or GLMs:
where g is called a LINK FUNCTION and the distribution of Y is known given its expectation:
Y | E (Y ) ~ FE (Y ) .
For logistic regression, the link function is logit and the distribution is Bernoulli(E(Y)).
See prisonersPicnic.R for examples.
BIOINF 2118
ANOVA, logistic regression, discriminant analysis, survival analysis p. 3 of 4
An important observation about logistic regression and discriminant analysis
æ Y ö æ Y
The data are: ç 1 ÷ ... ç n
çè X 1 ÷ø çè X n
ö
÷ where the Y are group membership indicators.
÷ø
In logistic regression, the Y are considered the “targets” or “dependent variables”.
We condition on the X’s, the” predictors” or “independent variables”.
But some techniques (like discriminant analysis) involve modeling the joint distribution of the X’s with
the Y’s. This has the effect of stabilizing estimates. In contrast, logistic regression models only the
conditional distribution Y|X. See Hastie, Tibshirani, Friedman Elements of Statistical Learning .
Compare logistic regression (LR)
ˆLR  arg max   log[Yi | X i ]
i
with linear discriminant analysis (LDA):
b̂ LDA = arg max b å log[Yi , X i ]
i
æ
ö
= arg max b ç å log[Yi | X i ] + å log[ X i ]÷ .
è i
ø
i
So LDA’s modeling of [X] is like a regularizer, as if there is a penalty function
.
For example, if { X i
(1) log[ X i ]
i
: Yi  0} and { X i : Yi  1} are perfectly separated by a hyperplane, then the (LR)
MLE’s will go to infinity (overfitting), while the (LDA) MLEs will not.
LR conditions on X. LDA models [X]. Since two Gaussian point clouds always extend to infinity and
therefore interpenetrate, perfect (over-)fitting can never be achieved by LDA.
BIOINF 2118
ANOVA, logistic regression, discriminant analysis, survival analysis p. 4 of 4
Proportional hazards regression
Time-to-event data comes in two types:
The event has happened.
The event has not yet happened.
You could just regard the outcome as binary, but that discards all the time information. Instead we
form a likelihood function that uses all the data correctly
“complete”
“complete”
“censored”
“censored”
“censored”
These require special “survival analysis” methods.

Cox proportional hazards regression:
Define the hazard function to be
, interpreted as the probability density
of “failing” (event happening) at t, given that the event hasn’t happened yet.
The proportional hazards assumption is:
h(t | X i ) = h0 (t )exp( X i b ) ,
where there is an unknown “baseline hazard function” h0 .
Other methods common in survival analysis are:


The Kaplan-Meier estimator, which estimates the c.d.f. of the event time. (Actually it estimates
the “survival function”, defined as 1 – c.d.f..)
The log-rank test, for testing whether two survival functions are the same or different. This is
useful for example, in a randomized clinical trial, when testing if two medicines differ in the time
to death, time to relapse, or time to some other event.
Download