CITY UNIVERSITY OF HONG KONG Course code & title : MS4225: Business Research Modelling Session : Semester B, Time allowed : 2 hours 2007-2008 This paper has 12 pages (including this page) Instructions to candidates: 1. 2. 3. Answer ALL THREE questions Show sufficient work for each question This question paper is NOT to be taken away Materials, aids and instruments permitted during examination: Approved calculator 1 Question 1 (35 marks) Table 1.1 classifies 30,292 Democratic voters of the 2008 U.S. Presidential Primary in the state of Texas according to gender (male, female), ethnic origin (White, Hispanic, African) and their choice of Democratic nominee (Clinton, Obama). Table 1.1 Male White Hispanic African Clinton 3,343 1,615 1,104 Obama 3,615 1,017 4,012 Female White Hispanic African 4,517 2,990 1,913 2,012 540 3,614 Let GENDER = 0 for female, 1 for male; RACE = 1 for White, 2 for Hispanic and 3 for African. Our goal is to estimate a logit model for the dependence of the choice of Democratic nominee on gender and race. To do this, we run the SAS program: DATA QUESTION1; INPUT GENDER RACE CLINTON OBAMA; TOTAL = CLINTON + OBAMA; DATALINES; 1 1 3343 3615 1 2 1615 1017 1 3 1104 4012 0 1 4517 2012 0 2 2990 540 0 3 1913 3614 ; PROC GENMOD DATA=QUESTION1; CLASS RACE; MODEL CLINTON/TOTAL=GENDER RACE/D=B COVB; RUN; i) Explain the purpose of the statement “MODEL CLINTON/TOTAL=GENDER RACE/D=B;” in the above SAS program. (3 marks) ii) Explain the purpose of the “CLASS RACE;”statement in the above SAS program. (2 marks) iii) What is the advantage of putting RACE in the CLASS statement and what will happen if this statement is removed? (2 marks) iv) Appendix 1 gives the parameter estimates and associated output. What are RACE 1, RACE 2 and RACE 3? Calculate and interpret the odds ratio estimates for GENDER RACE 1 and RACE 2. (6 marks) 2 v) Comment on the “overall” quality of the estimated model in terms of the Deviance test. Why is the number of degrees of freedom for the test equal to 2? (4 marks) vi) Using the estimated model, calculate the probabilities of a) a Hispanic female and b) a Black male choosing Clinton as the Democratic nominee. (6 marks) vii) Suppose the predicted probabilities of a White female and a Black female voting for Obama are 0.309572 and 0.622819 respectively, and the predicted probabilities of a White male and a Hispanic male voting for Clinton are 0.4817759 and 0.6522784 respectively. Use this information and the results of part vi) to work out the predicted frequency of each entry in the contingency table. (4 marks) viii) Using the results in vii), verify the Pearson Chi-Square value of 62.2001 in the SAS output. Provide an interpretation of what this value tells us. (8 marks) 3 Question 2 (35 marks) The marketing manager of Biotherm Homme, a major manufacturer of men’s skin-care products, is trying to determine whether or not to advertise in a magazine read mostly by young male professionals. He collected data on the age and occupation of customers purchasing products by Biotherm Homme and its two closest competitors, Clarins and Clinique, over the past three months. The data (in number of customers) are as follows: Age Occupational Status 25-35 Professional White collar non-professional Blue collar Professional White collar non-professional Blue collar >35 0 Let AGE 1 if 25 Age 35 if Age > 35 Biotherm Homme 36 52 8 24 22 2 Clarins Clinique 27 19 5 41 26 3 29 26 5 32 27 3 ; OS = 1 for Blue Collar, 2 for White collar Non-professional and 3 for Professional, and Product consumed (Y) is coded 1 for Biotherm Homme, 2 for Clarins and 3 for Clinique. A multinominal logit model with PROC CATMOD has been fitted with Y as the dependent variable and AGE and OS as explanatory variables. The SAS program is shown below and the results are shown in Appendix 2a. DATA QUESTION2; INPUT AGE OS Y FREQ; DATALINES; 0 3 1 36 0 3 2 27 0 3 3 29 0 2 1 52 0 2 2 19 . . . 1 2 3 27 1 1 1 2 1 1 2 3 1 1 3 3 ; PROC CATMOD DATA=QUESTION2; WEIGHT FREQ; DIRECT AGE OS; MODEL Y = AGE OS/NOITER; RUN; 4 i) Explain the rationale of treating the Y categories as unordered. ii) Are there evidences of an age effect and an occupational status effect? Answer this question using information from the ANOVA table of the output. (4 marks) iii) Write down the estimated equations of the log odds for Biotherm Homme vs. Clarins, Biotherm Homme vs. Clinique, and Clarins vs. Clinique. (7 marks) iv) Obtain the odds ratio estimate of AGE in the model for Biotherm Homme vs. Clinique. Give an interpretation of this odds ratio estimate. (3 marks) v) What is the purpose of the DIRECT statement in the SAS program? vi) Do you agree that Biotherm Homme is a popular product among young professional men? (2 marks) Discuss the meaning of the assumption of “Independence of Irrelevant Alternatives (IIA)”. (3 marks) Using results from Appendices 2a and 2b, conduct the Hausman-McFadden test for the IIA assumption. You may find the following information useful: (12 marks) vii) viii) 0.1248682E-01 0.4930000E-05 -0.5222100E-02 0.4930000E-05 0.3772000E-04 -0.8540000E-05 -0.5222100E-02 -0.8540000E-05 0.2728538E-02 = 401.7872 121.6721 769.3541 121.6721 26566.78 316.0170 (2 marks) (2 marks) 1 769.3541 316.0170 1839.939 5 Question 3 (30 marks) A company did a survey of employees between 55 and 65 years of age who were eligible for retirement. The dependent variable (R) was the response to the question of whether the employee would retire in the next twelve months, for which the employee could answer 0 for no, 1 for undecided, and 2 for yes. The explanatory variables were A, the age of the employee, N the number of years employed, and S, the monthly salary at the present time. The analyst wanted to take into account the ordering of the dependent variable R when estimating the model. i) Discuss the representation of the Ordered Logit model starting off with an unobserved (latent) variable. (6 marks) ii) In an Ordered Logit model, would the “marginal effects” always take the same sign as the corresponding coefficients? Why or why not? (6 marks) iii) The results reported in Appendix 3 have been obtained with SAS. Comment on the “overall quality” of this estimated model based on the Likelihood Ratio test. (5 marks) iv) What is the purpose of the “Score test for the proportional odds assumption”? Comment on the result of the test. (5 marks) v) Calculate the probability that an employee with the following characteristics will make no decision as to whether he will retire in the next twelve months: A = 64, N = 36 and S = 65,000. (5 marks) vi) How will the estimation results differ if the DESCENDING option is used in the PROC LOGISTIC statement? (3 marks) 6 Appendix 1 The SAS System The GENMOD Procedure Model Information Data Set Distribution Link Function Response Variable (Events) Response Variable (Trials) Number Number Number Number of of of of WORK.QUESTION1 Binomial Logit clinton total Observations Read Observations Used Events Trials 6 6 15482 30292 Class Level Information Class Levels race 3 Values 1 2 3 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF Value Value/DF 2 2 2 2 62.4794 62.4794 62.2001 62.2001 -18380.5253 31.2397 31.2397 31.1001 31.1001 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept gender race 1 race 2 race 3 Scale 1 1 1 1 0 0 -0.5485 -0.8750 1.3507 2.0527 0.0000 1.0000 0.0240 0.0254 0.0286 0.0372 0.0000 0.0000 Wald 95% Confidence Limits -0.5956 -0.9249 1.2946 1.9797 0.0000 1.0000 -0.5015 -0.8252 1.4067 2.1256 0.0000 1.0000 ChiSquare 522.84 1183.82 2230.07 3041.31 . Pr > ChiSq <.0001 <.0001 <.0001 <.0001 . NOTE: The scale parameter was held fixed. 7 Appendix 2a The CATMOD Procedure Data Summary Y FREQ QUESTION2 0 Response Weight Variable Data Set Frequency Missing Response Levels Populations Total Frequency Observations 3 6 387 18 Population Profiles Sample AGE OS Sample Size ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 0 1 18 2 0 2 97 3 0 3 92 4 1 1 8 5 1 2 75 6 1 3 97 Response Profiles Response Y ƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 1 2 2 3 3 Maximum Likelihood Analysis Maximum likelihood computations converged. Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept 2 7.51 0.0234 AGE 2 15.48 0.0004 OS 2 2.44 0.2949 Likelihood Ratio 6 3.11 0.7955 The CATMOD Procedure Analysis of Maximum Likelihood Estimates Function Standard ChiParameter Number Estimate Error Square Pr > ChiSq ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept 1 0.8745 0.5016 3.04 0.0813 2 -0.5253 0.5534 0.90 0.3425 AGE 1 -0.7051 0.2543 7.69 0.0056 2 0.2671 0.2595 1.06 0.3033 OS 1 -0.1728 0.2018 0.73 0.3917 2 0.1508 0.2158 0.49 0.4846 Covariance Matrix of the Maximum Likelihood Estimates Row Parameter Col1 Col2 Col3 Col4 Col5 Col6 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 Intercept 1 0.25163279 0.13980855 -.01643422 -.00998375 -.09560186 -.05199956 2 Intercept 2 0.13980855 0.30624572 -.01030705 -.02414533 -.05188285 -.11212132 3 AGE 1 -.01643422 -.01030705 0.06467743 0.03303173 -.00455919 -.00265971 4 AGE 2 -.00998375 -.02414533 0.03303173 0.06732694 -.00280056 -.00506059 5 OS 1 -.09560186 -.05188285 -.00455919 -.00280056 0.04071492 0.02190519 6 OS 2 -.05199956 -.11212132 -.00265971 -.00506059 0.02190519 0.04657614 8 Appendix 2b The following gives the Maximum Likelihood estimates and the covariance matrix of the estimates after deleting observations for Y =2: The CATMOD Procedure Analysis of Maximum Likelihood Estimates Standard ChiParameter Estimate Error Square Pr > ChiSq ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept 0.8948 0.5139 3.03 0.0817 AGE -0.7045 0.2544 7.67 0.0056 OS -0.1814 0.2071 0.77 0.3811 Covariance Matrix of the Maximum Likelihood Estimates Row Parameter Col1 Col2 Col3 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 Intercept 0.26411961 -.01642929 -.10082396 2 AGE -.01642929 0.06471515 -.00456773 3 OS -.10082396 -.00456773 0.04290003 9 Appendix 3 The SAS System The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Number of Observations Model Optimization Technique WORK.Q3 R 3 22 cumulative logit Fisher's scoring Response Profile Ordered Value R Total Frequency 1 2 3 0 1 2 6 7 9 Probabilities modeled are cumulated over the lower Ordered Values. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Score Test for the Proportional Odds Assumption Chi-Square DF Pr > ChiSq 1.7989 3 0.6152 The SAS System The LOGISTIC Procedure Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 51.712 53.894 47.712 31.313 36.768 21.313 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 26.3987 14.2666 6.6446 3 3 3 <.0001 0.0026 0.0841 Analysis of Maximum Likelihood Estimates 10 Parameter Intercept 0 Intercept 1 A N S DF Estimate 1 1 1 1 1 44.4240 48.0236 -0.5914 -0.0307 -0.00022 Standard Error 20.6158 21.3889 0.2991 0.0637 0.000086 Wald Chi-Square Pr > ChiSq 4.6434 5.0412 3.9106 0.2328 6.3513 0.0312 0.0248 0.0480 0.6295 0.0117 11