Multinomial Logit & Ordered Probit Multinomial Logit • Is used when the data cannot be ordered. An example is choice of holiday: (i) beach, (ii) mountain, (iii) culture. For each individual they are go on just one holiday. • We will examine this within the context of insurance data. The exact meaning does not matter, just treat it like holiday data. But for a clue type: describe summ *ins* label list insure use http://www.stata-press.com/data/r11/sysdsn1.dta,clear . tab2 insure insure -> tabulation of insure by insure insure Indemnity insure Prepaid Indemnity Prepaid Uninsure 294 0 0 0 277 0 0 0 45 294 277 45 Total 294 277 45 616 Uninsure Total There are 3 options: those who prepay, those who are not insured and those who are covered by an indemnity generate site1=site==1 generate site2=site==2 generate site3=site==3 NOW TYPE: mlogit insure age male nonwhite site2 site3 Multinomial logistic regression Number of obs LR chi2(10) Prob > chi2 Pseudo R2 Log likelihood = -534.36165 insure Coef. Std. Err. z P>|z| = = = = 615 42.99 0.0000 0.0387 [95% Conf. Interval] Prepaid age male nonwhite site2 site3 _cons -.011745 .5616934 .9747768 .1130359 -.5879879 .2697127 .0061946 .2027465 .2363213 .2101903 .2279351 .3284422 -1.90 2.77 4.12 0.54 -2.58 0.82 0.058 0.006 0.000 0.591 0.010 0.412 -.0238862 .1643175 .5115955 -.2989296 -1.034733 -.3740222 .0003962 .9590693 1.437958 .5250013 -.1412433 .9134476 -.0077961 .4518496 .2170589 -1.211563 -.2078123 -1.286943 .0114418 .3674867 .4256361 .4705127 .3662926 .5923219 -0.68 1.23 0.51 -2.57 -0.57 -2.17 0.496 0.219 0.610 0.010 0.570 0.030 -.0302217 -.268411 -.6171725 -2.133751 -.9257327 -2.447872 .0146294 1.17211 1.05129 -.2893747 .510108 -.1260135 Uninsure age male nonwhite site2 site3 _cons (insure==Indemnity is the base outcome) Note two equations one to exalpain those who opt for ‘prepaid’ and a second for those who opt for ‘uninsure’ • But there are three choices, so why two equations. Well if you know the determinants of two of the choices the third comes about from default. • It can also be viewed as the default choice against which the other two are being compared. • Here the default case is the first, indemnity. Could we change it? YES. • mlogit insure age male nonwhite site2 site3, base(2) This will change the default case to the second option. Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -555.85446 -534.72983 -534.36536 -534.36165 -534.36165 Multinomial logistic regression Number of obs LR chi2(10) Prob > chi2 Pseudo R2 Log likelihood = -534.36165 insure Coef. Std. Err. z P>|z| = = = = 615 42.99 0.0000 0.0387 [95% Conf. Interval] Indemnity age male nonwhite site2 site3 _cons .011745 -.5616934 -.9747768 -.1130359 .5879879 -.2697127 .0061946 .2027465 .2363213 .2101903 .2279351 .3284422 1.90 -2.77 -4.12 -0.54 2.58 -0.82 0.058 0.006 0.000 0.591 0.010 0.412 -.0003962 -.9590693 -1.437958 -.5250013 .1412433 -.9134476 .0238862 -.1643175 -.5115955 .2989296 1.034733 .3740222 .0039489 -.1098438 -.7577178 -1.324599 .3801756 -1.556656 .0115994 .3651883 .4195759 .4697954 .3728188 .5963286 0.34 -0.30 -1.81 -2.82 1.02 -2.61 0.734 0.764 0.071 0.005 0.308 0.009 -.0187855 -.8255998 -1.580071 -2.245381 -.3505358 -2.725438 .0266832 .6059122 .0646357 -.4038165 1.110887 -.387873 Uninsure age male nonwhite site2 site3 _cons (insure==Prepaid is the base outcome) Data also comes from: • use http://www.statapress.com/data/r11/sysdsn1.dta • mlogit insure age male nonwhite Clear, set memory and load data clear set mem 100000 use "http://staff.bath.ac.uk/hssjrh/oprob.dta" Describe pers . describe pers storage display variable name type format value label persit5yr qa5 double %10.0g variable label QA5 PERSONAL SITUATION - FIVE YEARS AGO • The variable relates to a person’s situation and how it has changed over the last five years. • Let us look at it. • Type: tab2 pers pers The most common response was improved, but for over half of the sample this was not the case QA5 PERSONAL SITUATION - FIVE YEARS AGO QA5 PERSONAL SITUATION - FIVE YEARS AGO Improved Stayed ab Got worse DK Total Improved Stayed about the same Got worse DK 11,178 0 0 0 0 9,533 0 0 0 0 8,418 0 0 0 0 301 11,178 9,533 8,418 301 Total 11,178 9,533 8,418 301 29,430 Ordered probit • We use this when we have discrete data and when it is ordered. In this case • 1 best (improved) • 2 next best (stayed about the same) • 3 worst (got worse). The ordering is clear. Change in personal situation Assume an underlying and continuous variable relating to changes in the individual’s personal situation Change in personal situation If this underlying variable is to the left of μ1 we classify the variable as ‘1’ the individual’s position has improved Change in personal situation If this underlying variable is to the right of μ2 we classify the variable as ‘3’ the individual’s position has got worse Change in personal situation In between these two values we classify the variable as ‘2’ the individual’s position has stayed the same • You might say: surely ‘stay the same’ is one specific value (perhaps 0) anything to the left of this has improved and anything to the right has got worse. • But it is common to assume a range of values which denote too small a change to denote either ‘improve’ or ‘got worse’ and these values are μ2 and μ1 Do the estimation. • Simply use oprobit rather than regress. oprobit persi lgnipc male age agesq rlaw estonia village town selfemp marrd educ2 unemp manual if age<98 & age>17 & persi<4 This regresses persi (note we do not have to write its full name as this is the only variable in the data set to begin with persi) on a set of right hand side variables if age<98 & age>17 & persi<4 This limits the regressions to individuals older than 17 and under 98 and also cuts out those who answered dont know (coded 4) for persi The results Ordered probit regression Number of obs LR chi2(13) Prob > chi2 Pseudo R2 Log likelihood = -25990.573 persit5yr Coef. Std. Err. lgnipc male age agesq rlaw estonia village town selfemp marrd educ2 unemp manual -.0766027 -.0249916 .0513208 -.0322755 -.2455444 -.869435 .0524684 .0338535 -.0974318 -.1534421 -.1429612 .6080104 .0821374 .0209432 .0147207 .0025145 .0025142 .011504 .0417246 .0184945 .0182975 .0293755 .015842 .0090999 .0313593 .0193907 /cut1 /cut2 -.6563796 .3095595 .078922 .0788323 z -3.66 -1.70 20.41 -12.84 -21.34 -20.84 2.84 1.85 -3.32 -9.69 -15.71 19.39 4.24 P>|z| 0.000 0.090 0.000 0.000 0.000 0.000 0.005 0.064 0.001 0.000 0.000 0.000 0.000 = = = = 25751 4392.02 0.0000 0.0779 [95% Conf. Interval] -.1176506 -.0538437 .0463924 -.0372033 -.2680919 -.9512136 .0162199 -.002009 -.1550067 -.1844918 -.1607966 .5465472 .0441323 -.0355548 .0038604 .0562492 -.0273478 -.222997 -.7876564 .088717 .0697159 -.0398569 -.1223923 -.1251257 .6694736 .1201426 -.8110638 .155051 -.5016953 .464068 Ordered probit regression Log likelihood = -25990.573 Number of obs LR chi2(13) Prob > chi2 Pseudo R2 = = = = 25751 4392.02 0.0000 0.0779 The summary output shows the number of observations, the log likelihood and the likelihood ratio. A pseudo R2 is exactly that and we may cover in the lectures later. It is rarely very high in ordered probit. persit5yr Coef. lgnipc male age agesq rlaw estonia village town selfemp -.0766027 -.0249916 .0513208 -.0322755 -.2455444 -.869435 .0524684 .0338535 -.0974318 Std. Err. .0209432 .0147207 .0025145 .0025142 .011504 .0417246 .0184945 .0182975 .0293755 z -3.66 -1.70 20.41 -12.84 -21.34 -20.84 2.84 1.85 -3.32 P>|z| 0.000 0.090 0.000 0.000 0.000 0.000 0.005 0.064 0.001 [95% Conf. Interval] -.1176506 -.0538437 .0463924 -.0372033 -.2680919 -.9512136 .0162199 -.002009 -.1550067 -.0355548 .0038604 .0562492 -.0273478 -.222997 -.7876564 .088717 .0697159 -.0398569 Remember the lower is the dependent variable (persi...) the better the person has done (1 for improved, 3 got worse). So a negative coefficient indicates that as that variable increases so the person tends to have been doing better. OK The self employed have been doing better as have people in Estonia???????? Those in countries with a good rule of law have done better and those in richer countries too (lgnipic: log Gross nattional income per capita) marrd educ2 unemp manual -.1534421 -.1429612 .6080104 .0821374 .015842 .0090999 .0313593 .0193907 -9.69 -15.71 19.39 4.24 0.000 0.000 0.000 0.000 -.1844918 -.1607966 .5465472 .0441323 -.1223923 -.1251257 .6694736 .1201426 Married people and educated people have been doing better but the unemployed and manual workers worse. Impact of age age agesq .0513208 -.0322755 .0025145 .0025142 20.41 -12.84 0.000 0.000 .0463924 -.0372033 The impact of age is thus 0.0513* AGE - 0.0322*AGE*AGE/100 0.0322*AGE*AGE/100 because this is how age squared was calculated So the impact is: AGE 25 40 55 70 IMPACT 1.0812 1.5368 1.8474 2.0132 As people get older the probability of things getting worse increases. WHY? .0562492 -.0273478 And finally /cut1 /cut2 -.6563796 .3095595 .078922 .0788323 These are the estimates of μ1 and μ2 -.8110638 .155051 -.5016953 .464068 • If for an individual the predicted value from the regression is less than -0.6564 then they would be predicted to be categorised as ‘1’ – position improved. • If for an individual the predicted value from the regression is greater than 0.3096 then they would be predicted to be categorised as ‘3’ –position has got worse.. • And if the predicted value lies between these two values, then predicted value is ‘no change’. Let us calculate some examples. First do the regression and store the coefficient vector as cy oprobit persi lgnipc male age agesq rlaw estonia village town selfemp marrd educ2 unemp manual if age<98 & age>17 & persi<4 matrix cy= e(b) oprobit persi lgnipc male age agesq rlaw estonia village town selfemp marrd educ2 unemp manual if age<98 & age>17 & persi<4 cy[1,1] is the coefficient on lgnipc. The average value for this is 3.0 • Then calculate scalar py50 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]* 50 + cy[1,4]* 50*50/100 + cy[1,5]*5+ cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 + cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 + cy[1,13]*0 cy[1,2] is the coefficient on male. Let us code this as 1, i.e. We are predicting for a man. • scalar py50 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]* 50 + cy[1,4]* 50*50/100 + cy[1,5]*5+ cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 + cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 + cy[1,13]*0 scalar py50 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]* 50 + cy[1,4]* 50*50/100 + cy[1,5]*5+ cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 + cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 + cy[1,13]*0 • The other characteristics are 50 years old, country with the highest level of rule of law (5), etc, . display py50 -.39618871 This lies between -0.6564 and 0.3096, the two critical values and hence this person would be predicted to be ‘no change’ Now let us try the same person, but aged 30. scalar py30 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]* 30 + cy[1,4]* 30*30/100 + cy[1,5]*5+ cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 + cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 + cy[1,13]*0 . display py30 -.90619611 This is less than the lower critical value of -0.6564 hence this person would be predicted to have improved. • No one has ever analysed this before and there may be a paper. • That people’s situation gets worse as they age is not surprising, once they reach say 50. But these results suggest It is so for those aged 30 viz a viz 20, just as much as 60 viz a viz 50. • Perhaps we should try a spline on this just to check the quadratic form on age is not misleading • And why do educated people fare better? Multinomial Logit ‘by hand’ program myologit args lnf xb a1 a2 quietly replace `lnf' = ln(1/(1+exp(-`a1' + `xb'))) if $ML_y1 == 1 quietly replace `lnf' = ln(1/(1+exp(-`a2'+ `xb')) 1/(1+exp(-`a1' + `xb'))) if $ML_y1 == 2 quietly replace `lnf' = ln(1 - 1/(1+exp(-`a2'+ `xb'))) if $ML_y1 == 3 end * specify the method (lf) and the name of your evaluator (myologit) * followed by the equation(s) in parantheses and then the cutpoints. ml model lf myologit (xb: insure = age male nonwhite ) /a1 /a2 ml check ml search ml maximize,iterate(50) ologit insure age male nonwhite oprobit insure age male nonwhite convergence not achieved Number of obs Wald chi2(3) Prob > chi2 Log likelihood = -547.75513 Std. Err. z P>|z| = = = 615 15.91 0.0012 insure Coef. [95% Conf. Interval] age male nonwhite _cons -.0087368 .5056461 .5615129 3.866424 .0055974 .1826912 .1958493 .2924209 -1.56 2.77 2.87 13.22 0.119 0.006 0.004 0.000 -.0197076 .147578 .1776553 3.29329 .0022339 .8637142 .9453705 4.439558 _cons 3.622825 .1567336 23.11 0.000 3.315632 3.930017 _cons 6.30395 . . . . . xb a1 a2 Warning: convergence not achieved Does not converge and no second cut off point. But the coefficients per se the same as if we use the ologit command: ologit insure age male nonwhite . ologit insure age male nonwhite Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -555.85446 -547.76723 -547.75513 -547.75513 Ordered logistic regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -547.75513 insure Coef. Std. Err. age male nonwhite -.0087368 .5056461 .5615129 .0055974 .1826912 .1958493 /cut1 /cut2 -.2435994 2.437526 .2619071 .2924209 z -1.56 2.77 2.87 P>|z| 0.119 0.006 0.004 = = = = 615 16.20 0.0010 0.0146 [95% Conf. Interval] -.0197076 .147578 .1776553 .0022339 .8637142 .9453705 -.7569278 1.864391 .2697289 3.01066 See also: http://www.ats.ucla.edu/stat/stata/code/ml_maximize.htm use http://www.stata-press.com/data/r11/sysdsn1.dta mlogit insure age male nonwhite ologit persi lgnipc male age agesq rlaw estonia village town selfemp marrd educ2 unemp manual if age<98 & age>17 & persi<4 program myologit args lnf xb a1 a2 * The contribution to the likelihood at each level of y quietly replace `lnf' = ln(1/(1+exp(-`a1' + `xb'))) if $ML_y1 == 1 quietly replace `lnf' = ln(1/(1+exp(-`a2'+ `xb')) - 1/(1+exp(-`a1' + `xb'))) if $ML_y1 == 2 quietly replace `lnf' = ln(1 - 1/(1+exp(-`a2'+ `xb'))) if $ML_y1 == 3 end ologit insure age male nonwhite oprobit insure age male nonwhite ml model lf myologit (xb: insure = age male nonwhite ) /a1 /a2 ml check ml search ml maximize