1 Multicategory Logit Models In the past, we have restricted the response or dependent variable in logit models to be dichotomous. Now we will consider a response variable, Y, with J levels. The explanatory or independent variables may be quantitative, qualitative, or both. There are three ways in which logistic regression models for response variables with more than two outcomes differ from logistic regression for dichotomous data. 1. How the logits are formed When J = 2 there is only one logit we can form, however when J > 2 there are J(J-1)/2 logits that we can form, but only J-1 of them are non-redundant. There are different ways to form the non-redundant logits, each of which results in a “dichotomizing” the response variable. The way we choose to form the logit will partly depend on whether Y is ordinal or nominal. 2. The sampling distribution When Y is dichotomous, at each combination of the explanatory variable we assume that data come from a binomial distribution. When J > 2, at each combination of the explanatory variable we assume that data come from a multinomial model. The binomial distribution is a special case of the multinomial distribution. The multinomial distribution depends on n and {n}. This distribution gives the probability for each way to classify the n observations into the J categories of the response variable. For example, the possible ways to classify n = 2 observations into J = 3 categories is: y1 y2 y3 2 0 0 0 2 0 0 0 2 1 1 0 0 1 1 1 0 1 3. Connections with other models, such as loglinear Some multicategory logit models are equivalent to Poisson regression or loglinear models, however some are derived from latent variable models. For example, some are very similar to IRT models, in terms of their parametric form, however in IRT we assume the predictor variable is observed, while in IRT the predictor is unobserved or latent. 2 Baseline Category Logit Model for Nominal Response Variables This model is basically just an extension of the binary logistic regression model. It gives a simultaneous representation of the odds of being in one category relative to being in another category, for all pairs of categories. With a set of J - 1 non-redundant odds we can figure out the odds for any pairs of categories. Suppose we have data that identifies respondents’ political affiliation as either democrat, republican, or independent and we want to know if political affiliation can be predicted by SES which is a quantitative (i.e. continuous) variable. For this data the response variable is party identification. We could fit a binary logit model to each pair of party identifications: (x ) democrat log 1 1 1 1 x log republican 2 ( x2 ) (x ) republican log 2 2 2 2 x log independent 3 ( x3 ) (x ) democrat log 1 1 3 3 x log independent 3 ( x3 ) We can write one of the odds in terms of the other two: 1 ( x1 ) 2 ( x2 ) 1 ( x1 ) 2 ( x2 ) 3 ( x3 ) 3 ( x3 ) Therefore, we can find the model parameters of one from the other two. (x ) (x ) (x ) log 1 1 log 2 2 log 1 1 2 ( x2 ) 3 ( x3 ) 3 ( x3 ) (1 1 x1 ) ( 2 2 x2 ) 3 3 x3 Which means that in the population: 1 2 3 and 1 2 3 With sample data the estimates from separate binary logit models are consistent estimatator of the parameters for the model, but estimates from fitting separate binary logit models will NOT yield the equality that holds in the population. In other words, ˆ 1 ˆ 2 ˆ 3 and ˆ 1 ˆ 2 ˆ 3 We can solve this problem by simultaneously estimating the parameters of the model. This will enforce the logical relationships among the parameters and will use the data more efficiently, resulting in smaller standard errors. 3 With the baseline category logit model we choose one of the categories as the “baseline”. This choice may be arbitrary or there may be a logical choice depending on the data. For convenience, we’ll use the last level (i.e. the Jth level) of the response variable as the baseline. The baseline category logit model with one explanatory variable, x, is: ij log iJ j j xi for j = 1, 2, … , J -1 For J = 2 this is just the regular binary logistic regression model. For J > 2, and can differ depending on which two categories are being compared. The odds for any pair of categories of Y are a function of the parameters of the model. Using the previous data and only using sex as an independent variable we have 3 - 1 = 2 nonredundant logits: democrat log 1 1 1 x log independent 3 republican log 2 2 2 x log independent 3 The logit for democrat and republican is: democrat log 1 log republican 2 1 3 log 2 3 = log(log( 1 1 x ) - ( 2 2 x ) = (1 2 ) (1 2 ) x The difference 1 2 is called a contrast. CAUTION: You MUST be certain what the computer program that you use to estimate the model is doing. Some programs set , some set J = 0 and others set . Using the parameters I obtained from fitting the model with SES predicting party affiliation by SES I obtained: democrat lôg 1 0.1502 0.00013( x) lôg independent 3 republican lôg 2 0.9987 0.0191( x) lôg independent 3 4 1 1 democrat 3 lôg lôg lôg republican 2 2 3 = log(log( = (0.1502 - 0.00013x) - (-.9987 + .0191x) = 1.1489 - .01923 x We can interpret the parameters of the model in terms of odds ratios, given an increase in SES. For a 10 point increase in SES index we obtain the following odds ratios: Democrat to Independent = exp(10(-0.00013)) = 0.998 Republican to Independent = exp(10(.191)) = 6.75 Democrat to Republican = exp(10(-.01923)) = 0.825 Republican to Democrat = 1/.825 = 1.212 Just as in binary logistic regression, we can also interpret the parameters of the model in terms of probabilities. The probability of a response being in category j is j exp( j j x) J exp( k 1 k k x) Note that for the baseline category (independent in our case), J = J = 0. This is an J identification constraint. Furthermore, the denominator, exp( k 1 probabilities sum to 1. Using our estimated parameters we obtain: ˆ democrat exp( 0.1502 .00013x) 1 exp(. 1502 .00013x) exp( .9987 .0191x) ˆ republican exp( 0.9987 .0191x) 1 exp(. 1502 .00013x) exp( .9987 .0191x) ˆ independent 1 1 exp(. 1502 .00013x) exp( .9987 .0191x) We can use these functions to plot the probabilities versus SES. k k x) ensures that the 5 Probability 0.8 0.6 0.4 0.2 0 0 Democrat 50 100 SES Republican Independent We can easily add more explanatory variables to our model and these variables can either be categorical or numeric. We identify numeric variables in proc catmod by the command “direct”. Furthermore, by all of the model comparison methods that we have used in the past will work with this model as well. The baseline category logit model can be used when the categories of the response variable are ordered, but it may not be the best model for ordinal responses. Proportional Odds Model for Ordinal Response Variables When the response variable is ordered we can use the ordering of the categories in forming the logits. When we use the ordering of the categories the resulting model is a more powerful model than the baseline logit model. It also yields a simpler model with simpler interpretations. We will only consider one of these models, the proportional odds model. For this model the effect of the explanatory variable(s) is the same regardless of how we collapse Y into dichotomous categories. Therefore, a single parameter describes the effect of x on Y, versus the J-1 parameters that are needed in the baseline model. However, the intercepts can differ. For this model we use cumulative probabilities which are the probabilities that Y falls in category j or below. In other words, P( Y ≤ j) = j, where j = 1, 2, … , J. Cumulative probabilities reflect the ordering of the categories and are used to form cumulative logits. A cumulative logit is of the form: 1 2 j P(Y j ) P(Y j ) log log log P(Y j ) 1 P(Y j ) j 1 J Models that use cumulative probabilities do not use the final category, P( Y ≤ J) since it must equal 1. 6 A model for the jth cumulative logit looks like an ordinary logit model for a dichotomous response variable in which categories 1 to j combine to from a single category. In other words, the response variable collapses into two categories, one up to j and one for j + 1 to J. The proportional odds ratio is of the form: P(Y j ) j x log 1 P(Y j ) Cumulative probabilities are given by: P(Y j ) exp( j x) 1 exp( j x) We can compute the probability of being in category j by taking differences between cumulative probabilities. In other words, P(Y = j) = P(Y ≤ j) - P(Y ≤ j - 1) for j = 2, … J P(Y = 1) = P(Y ≤ 1) Therefore, this model is sometimes referred to as a difference model. To interpret this model in terms of odds ratios for a given level of Y, say Y = j P(Y j | X x2 ) / P(Y j | X x2 ) P(Y j | X x2 ) P(Y j | X x1 ) P(Y j | X x1 ) / P(Y j | X x1 ) P(Y j | X x1 ) / P(Y j | X x2 ) = exp(j + x2)/ exp(j + x1) = exp[x1 -x2)] The odds ratio is proportional to the difference between x1 and x2 and since this proportionality is a constant = to this model is called the proportional odds model. We can fit this model using either proc logistic or proc catmod. When the number of categories is greater than 2 proc logistic fits a proportional odds model using maximum likelihood estimation. To use proc catmod you need to specify that the response to be modeled is clogits which is an abbreviation for cumulative logits. Proc catmod uses weighted least squares estimation to fit the proportional odds model. For large samples with categorical explanatory variables the results are almost the same. In general, maximum likelihood estimation is preferred with quantitative explanatory variables. I fit a cumulative odds ratio predicting whether or not one liked big band music from age and got the following: like it very much = -3.2566 like it = -1.2391 mixed feelings = -0.1981 dislike it = 1.6670 age = 0.0361 Interpreting this in terms of odds ratios, .0361(30 - 50) = .4857 7 Interpreting this in terms of cumulative probabilities: P(Y 1) exp( 3.2566 .0361x) 1 exp( 3.2566 .0361x) P(Y 2) exp( 1.2391 .0361x) 1 exp( 1.2391 .0361x) Etc. probability Graphing this we get: 1 0.8 0.6 0.4 0.2 0 10 30 50 70 90 age P(Y < 1) P(Y < 2) P(Y < 3) P(Y < 4) Calculating probabilities we get: P(Y 1) exp( 3.2566 .0361x) 1 exp( 3.2566 .0361x) P(Y 2) P(Y 2) P(Y 1) Etc. Graphing this we get: exp( 1.2391 .0361x) exp( 3.2566 .0361x) 1 exp( 1.2391 .0361x) 1 exp( 3.2566 .0361x) 8 0.6 probability 0.5 0.4 0.3 0.2 0.1 0 10 30 50 70 90 age P(Y = 1) P(Y = 2) P(Y=3) P(Y=4) P(Y = 5)