Statistic exam

Assignment in Statistics and Research Methods Winter 2014/2015 Jeppe Mulvad Jensen Katrine Bang-Achton Stefan Jørgensen BSc in International Business and Politics Copenhagen Business School Characters with spaces: 34.046 Word count: 5.552 Pages: 21 Declaration of authorship Page 2 of 21 Question 1 This question concerns a description of each of the variables given in the data set. Furthermore, remarkable observations concerning these variables are noted. The focus of the data set is a semi quantitative vocabulary test (wordsum) and a binary variable concerning legalization of marijuana (grass). Grass and wordsum are response variables and age, sex, education, real income, zodiac and class are thus explanatory variables. We differentiate between categorical and quantitative variables, where categorical variables are observations that belong to a certain set of categories and quantitative variables are numerical values. Our categorical variables are: Grass, sex, zodiac and class, while our quantitative variables are wordsum, age, education and real income. Furthermore, quantitative variables can be described as being either continuous or discrete. Discrete variables can be thought of as “counted numbers” or a set of separate numbers (e.g. 1, 2, 3, 4), while continuous variables have a continuum of infinitely many possible values. The discrete variables in our data set are wordsum, age and education. However, it should be noted that age and education in theory are continuous variables (e.g. you could be 20.25 years old and have studied for 16.54 years), but in the given data set these variables are counted as discrete. Real income is a continuous variable. To investigate whether there is anything remarkable to note, we will look at the distributions of our variables. Variable Distribution and remarkable observations Grass The variable is relatively equally distributed between legal/illegal: 52,5% is against legalization and the remaining 47,5% is pro legalization. However, the fact that 740 respondents did not state whether they are for or against might create a non-response bias. Wordsum Slightly skewed to the left and has a mean of 5,9. The interquartile range (IQR) is 2 (from 5 to 7) and thus 50% of our observations are between 5 and 7. Out of the 1974 respondents, the wordsum of 696 respondents is not given. Once again, this might create a non-response bias. Age Age has a mean of 48,2 and is slightly skewed to the right, but with no outliers. The range is 71 (between 18 and 89). Sex 55% is female, 45% is male. Education Ranges between 0 and 20 and has a mean of 13,53. Skewed to the left with an IQR of 4 (between 12 and 16). An education of 12 years takes up 27,38% of the sample. Page 3 of 21 Variable Distribution and remarkable observations Real Income Real income is highly skewed to the right, which makes sense since there is a lower limit (0), but in theory there is no upper limit. Real income ranges between 245 and 155140 with a mean of 34295. It is remarkable to note that there are 144 observations with the exact income of 155139,973, which are all marked as outliers. With the given sample size this is highly unlikely and must mean that all incomes above 155139,973 are marked as this exact income. Zodiac Evenly distributed Class Lower class accounts for 10,2%, working class for 43,6%, middle class for 42,9% and upper class for 3,3%. Thus, working class and middle class takes up 86,5% of the sample. Question 2 This question concerns the probability of being pro-legalization, i.e. we are looking at our response variable “grass”. The question is fourfold: First, we will state a confidence interval for being pro-legalization of grass. Secondly, we will perform a statistical test of the hypothesis of there being 50% probability of being prolegalization. Thirdly, we compare the probabilities of being pro-legalization given the sexes. Finally, we conduct a chi-square test to analyse whether zodiac plays a role. Confidence interval for pro-legalization of grass A confidence interval reflects the uncertainty about a parameter’s true value. Based on the sample distribution, the confidence interval ensures with a given certainty that the estimation of the population proportion falls within this interval. Assuming an approximately normal distribution, a 95% confidence level is adequate to see whether there is significance or not. As we have more than 15 failures (n𝑝̂ ≥ 15) and 15 successes (n(1 − 𝑝̂ )≥ 15) we assume a normal distribution. Although we have a total sample size of n=1974, only 1234 answered this question. We will treat these 1234 respondents as our sample size. With 586 respondents answering yes, we calculate the sample proportion answering yes as: 𝑝̂ = 586 1234 = 0,47488. The standard deviation of the population distribution is calculated as s = √p(1 − p)/n, whereas the estimated standard deviation of the sampling distribution is defined as the standard error (se) and is calculated as, se = √𝑝̂ (1 − 𝑝̂ )/𝑛. Thus, our confidence interval for a population proportion in favor of legalizing grass is 𝑝̂ ± 1,96*(se). Our standard error becomes: se = √0,47488(1 − 0,47488)/1234 = 0,0142156 and thus our confidence interval becomes: 0.47488 ±1,96*0.0142156 = (0,447 ; 0,50274). This means that we can with 95% confidence say that the probability Page 4 of 21 of being the pro-legalization is between 44,7% and 50,274%. We see that JMP produces a confidence interval slightly different from ours, but only marginally. Statistical test of the hypothesis that there is 50% probability of being pro-legalization Assumptions First, the variable has to be categorical (in our case either for or against legalization). Second, the data is obtained by randomization; this is an assumption we have throughout the assignment. Third, it has approximately a normal distribution with the “successes” (pro-legalization) and “failures” (against legalization) being at least 15. Hypotheses The null hypothesis is: 𝐻0 : p = 0,5 The alternative hypothesis is two-sided: 𝐻𝑎 : p ≠ 0,5, thus we test for the two-tail probability, as we are interested in whether the population proportion falls above or below a certain point, i.e. 0,5. Test statistic The test statistic tests how far the sample proportion, 𝑝̂ , falls from the null hypothesized proportion value of 𝑝0 = 0.5, when expecting the null hypothesis to be true. The test statistic, calculating the z-score, gives us the number of standard errors between the sample proportion and the null hypothesized proportion value used for the null hypothesis: z = 𝑝̂−𝑝 𝑠𝑒 0,47488−0,5 = 0,014215555 = -1,76708. A z-score of -1,76708 tells us that the sample proportion p̂ is 1,76708 standard errors below the null hypothesis value. P-value The P-value states the probability that the test statistic takes a value like the observed test statistic or even more extreme. From Table A1 we see the corresponding P-value, for the z-test is equal to 0,0392. As it is a two-sided test, the P-value has to be multiplied by 2: 2*(0,0392) = 0,0784. JMP has applied a Chi-square test to solve this question. JMP shows a P-value from the Chi-square test of 0,0775, which is sensible as z2 (-1,762 = 3,12) should be equal to Chi-square. JMP states that there is a (1-0,0775) = 0,9225 or 92,25% probability that being pro legalization takes another value than 50%. 1 Retrieved from Learn d. 27-12-2014 Page 5 of 21 Conclusion With a P-value of 0,0784, which is higher than our significance level of 0,05, we cannot reject our null hypothesis. Consequently we cannot reject that the population proportion of being pro-legalization is equal to our estimation of 50%. Probabilities of being pro-legalization between males and females The contingency table shows that 42,79% of the female respondents have answered yes as have 53,25% males. A confidence interval for the difference between the two population proportions, can be found by: (𝑝̂1-𝑝̂ 2 )±z(se). We choose a 95% CL. Ss we have a large random sample, the central limit theorem tells us that the distribution will approximately be normal, thus the confidence interval has a margin of error equal to 1.96 standard deviations. In order to find the confidence interval we need to find the standard error, which can be calculated as: se = √ 𝑝̂1 (1−𝑝̂1 ) 𝑛1 + 𝑝̂2 (1−𝑝̂2 ) 𝑛2 =√ 0.4279(1−0.4279) 680 + 0.5325(1−0.5325) 554 = 0.028449253. We can now calculate the confidence interval: (0,4279-0,5325)±1.96*0,028449253 = (-0.16036 ; -0.0488). We conclude with 95% confidence that the difference between male and female respondents when answering if grass should be legalized will be between -0.16036 and -0,0488. Thus, we can with 95% confidence say that that the proportion of women being pro-legalization is between 0,0488 lower and 0,16036 lower than that of men. The odds ratio represents the ratio of the likelihood of an event occurring in one group compared to the odds of the event occurring in another group, thus in this case we compare the odds for being pro-legalization for 291/680 males compared to females. The odds for women being pro legalization is: 389/680 = 0,74807 and the odds 295/554 for males to be pro legalization is: 259/554 = 1,138996. Consequently, 0,74807 the odds ratio is:0,38996 = 0,65678. The odds ratio in JMP bases the calculation on Non-event, thus the ratio is based on males and females responding negatively to legalization of grass. We construct a 95% CI to estimate the precision of the odds ratio. Given from the software, JMP: CI = (0.524096; 0.823059). If the confidence interval crosses 1 it implies that there is no difference between females and males. However here it does not, so the likelihood of being against-legalization and male is 65% less than for females with the true population effect, with 95% confidence, lying between 52,4% and 82,3%. Consequently, we can with great certainty say that males and females have different attitudes to legalization of grass. Page 6 of 21 Zodiac’s role on grass To check whether zodiac plays a role means testing for independence between the two variables zodiac and grass. We will test for independence by conducting a chi-squared test statistic given a contingency table (appendix 1). Assumptions The chi-squared test of independence is based on the following assumptions: 1. Two categorical values. This is in our case zodiac and grass. 2. Randomization 3. Expected count ≥ 5 in all cells. The expected cell count can be calculated by the following (𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙)𝑥(𝐶𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙) formula: 𝑇𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 . Thus, the expected cell count in the first cell can be calculated as: 92𝑥575 1202 = 44. Using this formula in all cells of the contingency table, we see that the third assumption is also satisfied. All of the three assumptions are therefore satisfied and we can conduct a chi-squared test. We choose a confidence level of 95%. Hypotheses Our null hypothesis (𝐻0 ) is that zodiac and grass are independent, i.e. that zodiac does not play a role on respondents’ attitudes towards grass. Our alternative hypothesis (𝐻𝑎 ) is that the two categorical variables are in fact dependent, i.e. zodiac does play a role on respondents’ attitudes towards grass. Test statistic To be able to reject or confirm the null hypothesis, we need to find the chi-squared value. The formula for finding the chi-squared value is: 𝑋 2 = ∑ (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡−𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡)2 . 𝑒𝑥𝑝𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 count in the first cell is 44, thus the chi-square value of the first cell is: As we saw above, the expected (43−44)2 44 = 0,0232. The chi-squared value for each cell is also written in the contingency table (appendix 1). Adding together the chi-squared values of all the 24 cells, we get a final chi-squared value of 7,3938. As we know, the larger the chi-squared value, the greater is the evidence against 𝐻0 , i.e. against independence. However, it can be hard to interpret whether a number is large or not, which is why we need to find a P-value in order to be able to reject or confirm our null hypothesis. P-value To find the P-value for the chi-squared statistic, we use the sampling distribution of the chi-squared statistic. Given the degrees of freedom, the P-value is the right-tail probability above the chi-squared value and Page 7 of 21 measures the probability of the chi-squared statistic for a random sample being larger than the observed value. The degrees of freedom can be calculated as: df = (rows - 1)x(columns-1). Thus, df is: (12-1)x(21)=11. Given df = 11 and 𝑋 2 = 7,3938 we can find our P-value by looking at Table C2. For df=11 the lowest chi-squared value shown is 13,7 with an associated P-value of 0,250. Since 𝑋 2 =7,3938 falls below this, it has a larger right-tail probabily. Thus, the P-value is >0,25. Conclusion With a P-value larger than 0,25 and a confidence level of 95%, we cannot reject our 𝐻0 hypothesis. Based on this sample, we therefore have evidence to support that an association does not exist between zodiac and grass. Question 3 This question concerns a comparison of wordsum between male and female respondents. First, we pursue to give a 95% CI for the difference between the means for thereafter to test the difference between the sexes with a statistical test. Second, we compare wordsum between the different classes. Comparison of wordsum between female and male respondents By constructing two histograms for female and male respondents and their results in a wordsum test (appendix 2), we see that their distributions are approximately symmetric with means only differing slightly with the mean for females being 5,9395 and 5,8988 for males. The standard deviation of females is 2,056 and while for men 2,046 (given by JMP). A potential outlier can be defined as an observation falling more than 3 standard deviations from the mean in a bell shaped distribution. Thus, observations may fall between 5,9395±(3*2,056) = (-0,23;12,11) for females and 5,8988±(3*2,046) = (-0,24;12,04) for males to not be an outlier. Since our observations can only fall between 0 and 10 we therefore have no outliers. When calculating a confidence interval for the difference between two means, we need to first calculate the standard error. The standard error expresses how well the difference between the sample means (𝑥̅1 − 𝑥̅2 ) 𝑠2 estimates the difference between the population means. The equation is: se = √𝑛1 + 1 𝑠22 . 𝑛2 As it can be seen in the picture from JMP the standard error equals 0,11517. For a 95% confidence interval the margin of error is approximately two standard errors, so the equation for the difference between the population means is: (𝑥̅1 − 𝑥̅2 ) ± 2(se). With the respective means of 5,9395 (women) and 5,8988 (men) the confidence interval becomes: (-0,26672 ; 0,18518). As the confidence interval for the difference between male and female responses regarding wordsum contains the value of 0 we cannot claim that there is a strict difference 2 Retrieved from Learn d. 27-12-2014 Page 8 of 21 between male and female respondents. When 0 falls in the interval it is possible that 𝜇1 = 𝜇2 . The following will conduct a t-test to investigate whether this is true. Assumptions First, a quantitative response variable for two groups. This is in our case wordsum. Second, independent random samples. Third, pproximately normal population distribution for each group. Hypothesis Ho: 𝜇1 = 𝜇2 , i.e. the two means are the same. Ha: 𝜇1 ≠ 𝜇2 , i.e. the two means are not the same. Test-statistic We conduct a significance test comparing the two means. The test will give us a t-test with an associated pvalue with which we will be able to reject or not reject the null-hypothesis. We use the following equation: t = ̅̅̅̅− (𝑥 𝑥̅̅̅)−0 1 ̅ 2 . 𝑠𝑒 Given our means and our standard error, t equals: (5,8988− 5,9396)−0 0,11517 = -0,354. P-value Given the t-value of -0,354 and df equal to 1240 (given by JMP) we get a two-sided p-value of 0,7234 by using software. Conclusion As we know, when the p-value is higher than the significance level (in our case 0,05), we cannot reject our null-hypothesis, thus we cannot reject that the two means are the same. Comparing wordsum between the different classes As the different values of classes are treated as categorical variables and the wordsum is a continuous variable we can compare these two with an analysis of variance (ANOVA), in order to compare the means of each group. With an ANOVA we can test for differences between the means in each class, i.e. to see if some classes are better at spelling than others. With ANOVA we test all the means simultaneously, both including variance within and between the classes. Page 9 of 21 As we are only comparing wordsum to class this is a one-way ANOVA with 4 levels (coinciding with 4 categories in the factor), which is why we are testing 4 means (one from each class). Assumptions: First, normal distribution. Seocnd, equal standard deviations in the populations and finally randomization. Hypotheses: To test for variance between the means we have a null hypothesis that states that the means are all equal, thus the difference between the group means and the overall means are all zero. H0: ∝1 = ∝2 = ∝3 = ∝4 = 0. H1: Some group means, ∝𝑖 , are not equal. Test Statistic: By making an F test, we assume an F distribution where all the values are positive and the distribution is positively skewed. The total variability represents the sum of squared deviations from the mean and thus can be written as; ∑𝑖,𝑗(𝑥𝑖𝑗 − 𝑥̿ )2 = ∑𝑖 𝑛𝑖 (𝑥̅𝑖 − 𝑥̿ )2 + ∑𝑖,𝑗(𝑥𝑖𝑗 − 𝑥̅𝑖 )2 = SStot = SStreat + SSerror, where: 𝑥̿ = 𝑔𝑟𝑎𝑛𝑑𝑚𝑒𝑎𝑛 = 5,9337, 𝑥̅𝑖 = groupmean ,(𝑥̅𝑖 − 𝑥̿ ) = variability between groups and lastly (𝑥𝑖𝑗 − 𝑥̅𝑖 ) = variability between groups. To find the F ratio we are dividing the mean square deviation between-groups with the one within-groups: Between-group: MSEtreat = 𝑆𝑆𝑡𝑟𝑒𝑎𝑡 𝑟−1 and within-group: MSEerror = 𝑆𝑆𝑒𝑟𝑟𝑜𝑟 𝑛−𝑟 𝑀𝑆𝐸 F = 𝑀𝑆𝐸𝑡𝑟𝑒𝑎𝑡 ~ Fr-1, n-k 𝑒𝑟𝑟𝑜𝑟 SStreat = nA(𝑥̅ A - 𝑥̿ )2 + … + nD(𝑥̅ D - 𝑥̿ )2 = 135*(5,22963 – 5,933702)2 + … + 40*(6,924 – 5,933702)2 = 268,3687 SSerror = 4985,0622 F= 𝑆𝑆𝑡𝑟𝑒𝑎𝑡 (𝑟−1) 𝑆𝑆𝑒𝑟𝑟𝑜𝑟 (𝑛−𝑟) = 268,3687 (4−1) 4985,0622 (1267−4) = 22,664353 We expect JMP to get results differing slightly from ours as software mostly use more sophisticated formulas to achieve high degree of accuracy. Thus, the F ratio in JMP is equal to 22,4110 (see table). P-value: F ratio = 22,4110 implies a critical value far out in the distribution, additive a P-value of 0,0001 (given from JMP) represents significance in our test and we can reject our null hypothesis. Page 10 of 21 Conclusion: As we can reject our null hypothesis, it implies that there is difference between at least two group means. Consequently, there must be difference in the level of spelling between classes. The clear difference we can see between the ability to spell correct between classes might not be explained best by class itself´, but by education. Using class as an explanatory variable might very well capture the effect of education as well. We can see a clear correlation between the wordsum test and class, but as usual we cannot apply any causality and claim that class is the reason for the difference. Question 4 In this question we perform a linear regression where wordsum is described by age. Furthermore, we compute a 95% confidence interval for the slope of the regression and consider if a quadratic model fits the data better. Linear regression (wordsum described by age) and confidence interval Performing a linear regression in the software JMP the regression equation is: Y=b+ax, where wordsum is the response variable (y) and age is the explanatory variable (x). In this case: wordsum = Intercept + age*x = 5,43478+0,01025x (See graphical illustration in Appendix 3). At first we notice that the P-value is far below the desired value of 5%, namely 0,0016, meaning that the regression model is statistically significant. When we move on to the parameter estimates it becomes clear that both the intercept and the slope are also statistically significant. The intercept should be interpreted as the expected wordsum for a person aged 0. Since the minimum age of the survey is 18 this is not a result with which we can make any inference about the population. The slope of 0,01 means that for each year you get older you are expected to answer 0,01 more question correct. Age is related to the response of the wordsum test, but it does only constitute a small effect as you would have to age a 100 years to get one more question correct. As the result of the wordsum test is a discrete variable taking a value from 0 to 10, we see that age seems to carry an effect that is only visible on average and does not make for any differences on the individual level. Another important part of the analysis of this regression is the effect of the very low Rsquare, meaning that age does not count as a major determinant. With an Rsquare of 0,007 age can explain approximately 0,7% of the already small movements described by the simple linear regression. Moving on to CI, we see a 95% CI for the slope of the regression is between 0,0038958 and 0,0166035. This reflects the small effect of age influencing the number of correct words in the wordsum test. Page 11 of 21 To conclude we cannot describe age as having a powerful effect on the wordsum even though it is statistically significant. Quadratic model The quadratic model can be seen in appendix 4. Fitting a quadratic model for wordsum described by age seems to generate a better fit with the data, i.e. produce smaller residuals. The equation for the quadratic model is: wordsum = 5,4840382 + 0,0122259*age - 0,0004581*(age-48,1935)2. Comparing the quadratic model with the linear regression we see that the Rsquare is larger for the polynomial than for the linear regression model. The Rsquare of the polynomial is 0,013 and thus not far from twice as large as for the linear model. Taking the same approach as for the linear regression we see that the quadratic model is statistically significant, as are all the explanatory parameters when applying our CL. Comparing the individual P values of the explanatory variables it becomes apparent that they are very small and not that far from each other for both models. Therefore we do not draw any conclusions of which model is better at describing the data on the basis of the P values. The main difference between the two models is that the linear model assumes a constant marginal effect of the explanatory variable, i.e age, whereas the quadratic model allows for different effects for different levels of the explanatory variable. In this particular case the model tells us that there is a cutoff at the age of 48 at which one’s results in the wordsum test seems to deteriorate. In conclusion, the quadratic model does seem to produce a better fit as both the Rsquare is greater and the coefficients show significance. Question 5 In this question we are conducting a multiple linear regression analysis by adding sex, education, class and real income to the regression model to investigate to what extent they are explanatory to wordsum. a) Additive model Y = 1,4145+ 0,01279*age + 0,02672*sex(female) + 0,28092*education + (certain)class + 4,0441*realinc. Analysing the data given from the model in JMP, we note that according to the F-test we can conclude that at least one of the variables is statistically significant. We note that the variables “sex” and all those Page 12 of 21 related to class are statistically insignificant. The coefficient “intercept” seems significant, but the interpretation of it, however, makes it less significant. As we have used categorical explanatory variables, the intercept is interpreted as the correct number of answers in the wordsum test for an upper class, male, aged 0 and with 0 real income. As this is not a type of person observed in the data, nor does the existence of such a person seem probable, we cannot use this coefficient to draw any inferences. The Rsquare of the model is just above 23% and this means that the model is able to explain approximately 23% of the wordsum test. b) Assumptions, nonlinear effects/ interactions and modification of the model In JMP the method, Ordinary least squares (OLS) is used in order to make the residuals as small as possible, thus finding the best suitable regression. An assumption in using OLS is residuals being approximately normal distributed. In this case the residuals have a skewness factor of -0,252. To make sure that the assumptions of the linear regression are not violated we plotted the residuals and calculated the skewness factor. The skewness factor is sufficiently Figur 1: Residuals of wordsum, JMP close to 0 to claim that the residuals are approximately normally distributed and implies that: 1) The error term has a population mean of 0. 2) Errors are uncorrelated and 3) the error term has constant variance. We also check that: All explanatory variables are uncorrelated with the error term and that no explanatory variable is a perfectly linear function of another (appendix 5). As all these assumptions are fulfilled we confirm the results of the model. Possible nonlinear effects The effect of real income when logarithmically transformed on wordsum Rsquare moves from 0,066 to 0,111 and provides a better fit. Graphing the means of real income relating to wordsum will also show something similar to a logarithmic function. None of the other variables shows any tendencies of strong non-linear effects. This has been checked by graphing the individual explanatory variables in relation to the wordsum test (appendix 5). Intuitively not many possible interactions seem to be present. When searching for interactions it might at first seem sensible to check for an interaction between real income and class as these two variables seems to Page 13 of 21 describe nearly the same. However, this is misleading as these two variables both seem to be a measure of income therefore we would not expect to see that a change in real income would affect the impact that class has on the wordsum test. Of all the possible interactions we have tested for, none seems to be even slightly significant. The coefficients of the interaction terms have all been very close to 0 and the P-value corresponding to the T-tests have all shown to be way below our CL. Possible interactions are found by multiplying two variables and treating the result as a variable in its own right, while still including the original two variables in the regression model. As none of the interaction terms we tested for proved to have a statistically significance our model now takes into account: Age, log of real income, class and education. We have removed the variable sex as it failed to provide any predictive value on the wordsum test. Even though we earlier established that age has a higher predictive value when fitted as a quadratic polynomial we include it as linearly in our final model, as we cannot integrate a 2nd degree polynomial into a multiple linear regression. c) Analysing the statistical significance of the predictors The significance of the predictors: Age, education and real income all perform very well according to their statistical significance. Class on the other hand now performs worse than in the previous model and all the different class categories are now labeled as insignificant. Thus, we cannot deny the hypothesis that their coefficients are 0. In a simplification of the model we eliminate insignificant predictors, such as class and sex (that one we removed just prior to this subsection). Consequently, an optimal model with the information given to predict wordsum is including the predictors: age, education and log of real income. Deduced from JMP we now have an F ratio equal to 124,5 and a Rsquare equal to 0,2456. This reflects a relatively good model for predicting wordsum. The model before being simplified actually had a Rsquare 0,0009 higher than the Page 14 of 21 present one, however as we have a much higher F ratio, and all predictors show significance and have positive CI (positive relation between each predictor and wordsum). The current model is given our analysis the best. The decrease in Rsquare is very small and thus a small price for increasing the significance of the explanatory variables, (remember we removed class) and increasing the overall F-test. Conversely one could say that the slightly higher Rsquare in the first model is due to the inclusion of class, but as this is an insignificant variable we do not wish to use it for artificially increasing our Rsquare. A 95% CI for the effect of education is given in JMP: 0.2278;0.3036. Thus we can with 95% confidence say that education increases our prediction of correct answers in the wordsum test with 0,2278 to 0,3036 per extra year of education, holding the other variables constant. Question 6 a) Logistic regression model prediction grass from education Logistic regression is used when our response variable is categorical. As we only have two categories in the response variable, it can be defined as a dummy variable with the binary outcome being coded as 0(against legalization) or 1(for legalization). The population mean is equal to the population proportion of people being for legalization, µy = p, hence it must be between 0 and 1 for all positive values of x as long as the 𝑒 𝛼+𝛽𝑥 numerator is smaller than the denominator in the following logistic regression equation: p = 1+ 𝑒 𝛼+ 𝛽𝑥 . Consequently, the shape of the regression becomes more realistic as an S-shape rather than a linear trend. β does not represent the slope, but determines the steepness of the curve as β increases, the curve will become steeper. Intercept = α = -1.1410444 and β = 0.07687169 The minimum value of our explanatory variable, years of education, is 0, thus 𝑝̂ = 0,24213 (probability of being pro-legalization with 0 years of education). The maximum value of years of education is 20, such that 𝑝̂ = 0,59782. There is an upward sloping trend in education and being pro-legalization When a person has 14.8 years of education, the probability of being pro or against legalization is equally likely (p = 0.5. x = -α/β = 14.8434931) Page 15 of 21 The odds ratio represents the odds that an outcome Odds p/(1-p) 0 yrs of education 1 yr of education Pro 0.3194875 0.345013162 Against 3.1300128 2.89843789 (pro-legalization) will occur given education, compared to the odds of the outcome occurring in the absence of education (or with a year less of education). Odds ratio for being pro legalization given 1 year of probability (p) 0 yrs of educ 1 yr of education Pro 0.24213 0.256513 Against 0.75787 0.743487 education = 0.345013162/0.3194875 = 1.079896. Odds ratio against legalization given one year of education = 3.1300128/2.89843789 = 0.9260. The Unit Odds ratio equals 1.079904 (given from JMP) which means that for each additional year of education the likelihood for the person to be pro-legalization of grass will be 1.0799 times higher than before. As the unit odds ratio is relatively small (close to 1) it coincides with education as not being very explanatory in the observed persons’ attitudes towards grass. Given from JMP a 95% CI for the odds ratio = (1.041519; 1.120413). This is a relatively small CI which coincides with high precision in our odds ratio, moreover as the interval does not cross 1, we know with 95% confidence that education has a positive effect in being prolegalization of grass. To evaluate the predictive powers for the whole model we investigate the sensitivity and specificity measures, as logistic regression is not compatible with R2. Sensitivity is the ability to predict an event correctly, specificity is the ability to predict a non-event correctly. Sensitivity is calculated by dividing the correctly predicted events with the observed events: 225/(225+361)=0,384 or 38%. Specificity is calculated by dividing the correctly predicted non-events with the observed non-events: 435/(213+435)=0,671 or 67%. Furthermore, 213/(225+213)= 48% of predicted events are false positives and 361/(361+435)= 45% of predicted non-events are false negatives. Consequently, specificity and sensitivity implies that we should be cautious in using the logistic regression model to correlate the predictions in the model with the population. b) Multiple logistic regression We now extend the model to a multiple logistic regression predicting grass from the variables age, Page 16 of 21 class sex and wordsum. The parameter estimates can be seen in the JMP output. The formula for the multiple logistic regression prediction equation expands to: 𝑝̂ = 𝑒 −0,54+0,0073∗𝑒𝑑𝑢𝑐−0,019∗𝑎𝑔𝑒+(𝑐𝑒𝑟𝑡𝑎𝑖𝑛)𝑐𝑙𝑎𝑠𝑠−0,288∗𝑠𝑒𝑥(𝐹𝑒𝑚𝑎𝑙𝑒)+0,2442∗𝑤𝑜𝑟𝑑𝑠𝑢𝑚 . 1+ 𝑒 −0,54+0,0073∗𝑒𝑑𝑢𝑐−0,019∗𝑎𝑔𝑒+(𝑐𝑒𝑟𝑡𝑎𝑖𝑛)𝑐𝑙𝑎𝑠𝑠−0,288∗𝑠𝑒𝑥(𝐹𝑒𝑚𝑎𝑙𝑒)+0,2442∗𝑤𝑜𝑟𝑑𝑠𝑢𝑚 The positive parameter estimates such as education, lower class and wordsum implies a positive relationship with attitude towards grass, thus having a higher estimated probability of being pro-legalization of grass when having higher education or the like. Vice versa for the negative coefficients it implies a negative effect on the attitude towards grass with a lower estimated probability of being pro-legalization. The P-values for respectively age, middle class, sex and wordsum all show significance with P-values below our significance level, coinciding with strong evidence of the corresponding population effects on grass being positive. However education, lower class and middle class all have P-values above our significance level, coinciding with them not being explanatory when predicting grass. (c) Statistical tests of significance for each predictor JMP reports the Likelihood Ratio test with Chisquare tests to check for significance for the different coefficients. In both the parameter estimates and the Likelihood Ratio tests both age, sex and wordsum reports a P-value less than 0,05, showing significance in all the parameters. Consequently, we can reject the null hypothesis in these categories, saying that the attitude to legalization of grass does depend on these parameters. However in the parameter estimate class, only middle class shows significance with a P-value less than 0,05, the null hypothesis can be rejected, as there is a difference in attitude between upper- and middle class. Because one of the 4 categories in class has shown significance, the effect Likelihood Ratio test also reports an overall significance; we should therefore not reassess the multiple logistic regression without this parameter. The effect likelihood ratio test for education shows a very high P-value, much greater than our significance level of 0,05, consequently this parameter should be removed in a modification of the model, has no predictive power for the population’s attitude towards grass. Without education the explanatory variables: Age, class, sex and wordsum reports significance in the Likelihood ratio test, they do significantly explain the populations attitude towards grass. Page 17 of 21 Appendix Appendix 1 – Question 2 Appendix 2 – Question 3 Page 18 of 21 Appendix 3 – Question 4 Appendix 4 – Question 4 Page 19 of 21 Appendix 5 – Question 5 Checking for possible non-linear relations with wordsum as response variable. A non-linear relationship between realinc and wordsum seems to be the case. A log function shows good explanatory power in JMP The relationship between the means of wordsum corresponding to age does not indicate any obvious choice of model other than that of a linear. Page 20 of 21 The relationship between the means of wordsum corresponding to education does not show any clear type of correlation that would lead us to transforming the variable to anything else than a linear coefficient when performing our regression. A similar conclusion is drawn from this graph as from the previous. Page 21 of 21

Statistic exam

Related documents

Products

Support

Statistic exam

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib