BIOINF 2118 ANOVA, logistic regression, discriminant analysis, survival analysis p. 1 of 4 ANOVA: Analysis of variance In “supervised learning” (regression analysis), when a predictor is a CATEGORY, we can replace it by a series of indicator functions, then do linear regression as before. See example: hotdogs.R The “ANOVA” point of view : partitioning the sum of squares of deviations. For a “one-way layout”, suppose there are n i observations in category i , i = 1,..., p . Define We compare the model in which all the individual means are the same: H0: all E(Yij ) = m to the “omnibus” (all-direction) alternative HA: not all E (Yij ) are the same. Under either hypothesis, Replacing the right-hand side by its mean Under the null hypothesis, and . , we get the estimate 2 sˆ y2 (within) = Swithin / (n - p). . 2 sˆ y2 (between) = Sbetween / (p - 1). The two estimates should be similar. 2 But under the alternative, since the true means will differ from each other, Sbetween will be bigger. So a good test statistic, whose distribution is exactly known when data are normal, is . This is a ratio of independent chisquare statistics, so it’s an Fp-1,n-p random variate under H0. BIOINF 2118 ANOVA, logistic regression, discriminant analysis, survival analysis p. 2 of 4 Logistic regression When the target (dependent variable) is BINARY, the regression methods we’ve discussed don’t work well. Instead we use LOGISTIC models: where “logit” means log odds: ez . Its inverse is antilogit( z ) = . 1+ ez The likelihood function is Maximizing the likelihood usually requires an iterative algorithm. Once we have a model fit, say we can predict for future observations: . Logistic regression is a special case of GENERALIZED LINEAR MODELS or GLMs: where g is called a LINK FUNCTION and the distribution of Y is known given its expectation: Y | E (Y ) ~ FE (Y ) . For logistic regression, the link function is logit and the distribution is Bernoulli(E(Y)). See prisonersPicnic.R for examples. BIOINF 2118 ANOVA, logistic regression, discriminant analysis, survival analysis p. 3 of 4 An important observation about logistic regression and discriminant analysis æ Y ö æ Y The data are: ç 1 ÷ ... ç n çè X 1 ÷ø çè X n ö ÷ where the Y are group membership indicators. ÷ø In logistic regression, the Y are considered the “targets” or “dependent variables”. We condition on the X’s, the” predictors” or “independent variables”. But some techniques (like discriminant analysis) involve modeling the joint distribution of the X’s with the Y’s. This has the effect of stabilizing estimates. In contrast, logistic regression models only the conditional distribution Y|X. See Hastie, Tibshirani, Friedman Elements of Statistical Learning . Compare logistic regression (LR) ˆLR arg max log[Yi | X i ] i with linear discriminant analysis (LDA): b̂ LDA = arg max b å log[Yi , X i ] i æ ö = arg max b ç å log[Yi | X i ] + å log[ X i ]÷ . è i ø i So LDA’s modeling of [X] is like a regularizer, as if there is a penalty function . For example, if { X i (1) log[ X i ] i : Yi 0} and { X i : Yi 1} are perfectly separated by a hyperplane, then the (LR) MLE’s will go to infinity (overfitting), while the (LDA) MLEs will not. LR conditions on X. LDA models [X]. Since two Gaussian point clouds always extend to infinity and therefore interpenetrate, perfect (over-)fitting can never be achieved by LDA. BIOINF 2118 ANOVA, logistic regression, discriminant analysis, survival analysis p. 4 of 4 Proportional hazards regression Time-to-event data comes in two types: The event has happened. The event has not yet happened. You could just regard the outcome as binary, but that discards all the time information. Instead we form a likelihood function that uses all the data correctly “complete” “complete” “censored” “censored” “censored” These require special “survival analysis” methods. Cox proportional hazards regression: Define the hazard function to be , interpreted as the probability density of “failing” (event happening) at t, given that the event hasn’t happened yet. The proportional hazards assumption is: h(t | X i ) = h0 (t )exp( X i b ) , where there is an unknown “baseline hazard function” h0 . Other methods common in survival analysis are: The Kaplan-Meier estimator, which estimates the c.d.f. of the event time. (Actually it estimates the “survival function”, defined as 1 – c.d.f..) The log-rank test, for testing whether two survival functions are the same or different. This is useful for example, in a randomized clinical trial, when testing if two medicines differ in the time to death, time to relapse, or time to some other event.