Statistic exam

advertisement
Assignment in Statistics and Research
Methods
Winter 2014/2015
Jeppe Mulvad Jensen
Katrine Bang-Achton
Stefan Jørgensen
BSc in International Business and Politics
Copenhagen Business School
Characters with spaces: 34.046
Word count: 5.552
Pages: 21
Declaration of authorship
Page 2 of 21
Question 1
This question concerns a description of each of the variables given in the data set. Furthermore,
remarkable observations concerning these variables are noted.
The focus of the data set is a semi quantitative vocabulary test (wordsum) and a binary variable
concerning legalization of marijuana (grass). Grass and wordsum are response variables and age,
sex, education, real income, zodiac and class are thus explanatory variables.
We differentiate between categorical and quantitative variables, where categorical variables are
observations that belong to a certain set of categories and quantitative variables are numerical
values. Our categorical variables are: Grass, sex, zodiac and class, while our quantitative variables
are wordsum, age, education and real income.
Furthermore, quantitative variables can be described as being either continuous or discrete.
Discrete variables can be thought of as “counted numbers” or a set of separate numbers (e.g. 1, 2, 3,
4), while continuous variables have a continuum of infinitely many possible values. The discrete
variables in our data set are wordsum, age and education. However, it should be noted that age and
education in theory are continuous variables (e.g. you could be 20.25 years old and have studied for
16.54 years), but in the given data set these variables are counted as discrete. Real income is a
continuous variable.
To investigate whether there is anything remarkable to note, we will look at the distributions of our
variables.
Variable
Distribution and remarkable observations
Grass
The variable is relatively equally distributed between legal/illegal: 52,5% is against
legalization and the remaining 47,5% is pro legalization. However, the fact that 740
respondents did not state whether they are for or against might create a non-response bias.
Wordsum
Slightly skewed to the left and has a mean of 5,9. The interquartile range (IQR) is 2 (from
5 to 7) and thus 50% of our observations are between 5 and 7. Out of the 1974
respondents, the wordsum of 696 respondents is not given. Once again, this might create a
non-response bias.
Age
Age has a mean of 48,2 and is slightly skewed to the right, but with no outliers. The range
is 71 (between 18 and 89).
Sex
55% is female, 45% is male.
Education
Ranges between 0 and 20 and has a mean of 13,53. Skewed to the left with an IQR of 4
(between 12 and 16). An education of 12 years takes up 27,38% of the sample.
Page 3 of 21
Variable
Distribution and remarkable observations
Real Income
Real income is highly skewed to the right, which makes sense since there is a lower limit
(0), but in theory there is no upper limit. Real income ranges between 245 and 155140
with a mean of 34295. It is remarkable to note that there are 144 observations with the
exact income of 155139,973, which are all marked as outliers. With the given sample size
this is highly unlikely and must mean that all incomes above 155139,973 are marked as
this exact income.
Zodiac
Evenly distributed
Class
Lower class accounts for 10,2%, working class for 43,6%, middle class for 42,9% and
upper class for 3,3%. Thus, working class and middle class takes up 86,5% of the sample.
Question 2
This question concerns the probability of being pro-legalization, i.e. we are looking at our response variable
“grass”. The question is fourfold: First, we will state a confidence interval for being pro-legalization of grass.
Secondly, we will perform a statistical test of the hypothesis of there being 50% probability of being prolegalization. Thirdly, we compare the probabilities of being pro-legalization given the sexes. Finally, we
conduct a chi-square test to analyse whether zodiac plays a role.
Confidence interval for pro-legalization of grass
A confidence interval reflects the uncertainty about a parameter’s true value. Based on the sample
distribution, the confidence interval ensures with a given certainty that the estimation of the population
proportion falls within this interval. Assuming an approximately normal distribution, a 95% confidence level
is adequate to see whether there is significance or not. As we have more than 15 failures (n𝑝̂ ≥ 15) and 15
successes (n(1 − 𝑝̂ )≥ 15) we assume a normal distribution. Although we have a total sample size of
n=1974, only 1234 answered this question. We will treat these 1234 respondents as our sample size. With
586 respondents answering yes, we calculate the sample proportion answering yes as: 𝑝̂ =
586
1234
= 0,47488.
The standard deviation of the population distribution is calculated as s = √p(1 − p)/n, whereas the
estimated standard deviation of the
sampling distribution is defined as the
standard error (se) and is calculated as, se = √𝑝̂ (1 − 𝑝̂ )/𝑛. Thus, our confidence interval for a population
proportion in favor of legalizing grass is 𝑝̂ ± 1,96*(se). Our standard error becomes: se =
√0,47488(1 − 0,47488)/1234 = 0,0142156 and thus our confidence interval becomes: 0.47488
±1,96*0.0142156 = (0,447 ; 0,50274). This means that we can with 95% confidence say that the probability
Page 4 of 21
of being the pro-legalization is between 44,7% and 50,274%. We see that JMP produces a confidence
interval slightly different from ours, but only marginally.
Statistical test of the hypothesis that there is 50% probability of being pro-legalization
Assumptions
First, the variable has to be categorical (in our case either for or against legalization). Second, the data is
obtained by randomization; this is an assumption we have throughout the assignment. Third, it has
approximately a normal distribution with the “successes” (pro-legalization) and “failures” (against
legalization) being at least 15.
Hypotheses
The null hypothesis is: 𝐻0 : p = 0,5
The alternative hypothesis is two-sided: π»π‘Ž : p ≠ 0,5, thus we test for the two-tail probability, as we are
interested in whether the population proportion falls above or below a certain point, i.e. 0,5.
Test statistic
The test statistic tests how far the sample proportion, 𝑝̂ , falls from the null hypothesized proportion value of
𝑝0 = 0.5, when expecting the null hypothesis to be true. The test statistic, calculating the z-score, gives us the
number of standard errors between the sample proportion and the null hypothesized proportion value used
for the null hypothesis: z =
𝑝̂−𝑝
𝑠𝑒
0,47488−0,5
= 0,014215555 = -1,76708. A z-score of -1,76708 tells us that the sample
proportion pΜ‚ is 1,76708 standard errors below the null hypothesis value.
P-value
The P-value states the probability that the test statistic takes a value like the observed test statistic or even
more extreme. From Table A1 we see the corresponding P-value, for the z-test is equal to 0,0392. As it is a
two-sided test, the P-value has to be multiplied by 2: 2*(0,0392) = 0,0784.
JMP has applied a Chi-square test to solve this question. JMP
shows a P-value from the Chi-square test of 0,0775, which is
sensible as z2 (-1,762 = 3,12) should be equal to Chi-square. JMP
states that there is a (1-0,0775) = 0,9225 or 92,25% probability
that being pro legalization takes another value than 50%.
1
Retrieved from Learn d. 27-12-2014
Page 5 of 21
Conclusion
With a P-value of 0,0784, which is higher than our significance level of 0,05, we cannot reject our null
hypothesis. Consequently we cannot reject that the population proportion of being pro-legalization is equal
to our estimation of 50%.
Probabilities of being pro-legalization between males and females
The contingency table shows that 42,79% of the female
respondents have answered yes as have 53,25% males.
A confidence interval for the difference between the two
population proportions, can be found by: (𝑝̂1-𝑝̂ 2 )±z(se).
We choose a 95% CL. Ss we have a large random sample,
the central limit theorem tells us that the distribution will
approximately be normal, thus the confidence interval has a margin of error equal to 1.96 standard
deviations. In order to find the confidence interval we need to find the standard error, which can be
calculated as: se = √
𝑝̂1 (1−𝑝̂1 )
𝑛1
+
𝑝̂2 (1−𝑝̂2 )
𝑛2
=√
0.4279(1−0.4279)
680
+
0.5325(1−0.5325)
554
= 0.028449253. We can
now calculate the confidence interval: (0,4279-0,5325)±1.96*0,028449253 = (-0.16036 ; -0.0488). We
conclude with 95% confidence that the difference between male and female respondents when answering if
grass should be legalized will be between -0.16036 and -0,0488. Thus, we can with 95% confidence say that
that the proportion of women being pro-legalization is between 0,0488 lower and 0,16036 lower than that of
men.
The odds ratio represents the ratio of the likelihood of an event occurring in one group compared to the odds
of the event occurring in another group, thus in this case we compare the odds for being pro-legalization for
291/680
males compared to females. The odds for women being pro legalization is: 389/680 = 0,74807 and the odds
295/554
for males to be pro legalization is: 259/554 = 1,138996. Consequently,
0,74807
the odds ratio is:0,38996 = 0,65678. The odds ratio in JMP bases the
calculation on Non-event, thus the ratio is based on males and females responding negatively to legalization
of grass. We construct a 95% CI to estimate the precision of the odds ratio. Given from the software, JMP:
CI = (0.524096; 0.823059). If the confidence interval crosses 1 it implies that there is no difference between
females and males. However here it does not, so the likelihood of being against-legalization and male is 65%
less than for females with the true population effect, with 95% confidence, lying between 52,4% and 82,3%.
Consequently, we can with great certainty say that males and females have different attitudes to legalization
of grass.
Page 6 of 21
Zodiac’s role on grass
To check whether zodiac plays a role means testing for independence between the two variables zodiac and
grass. We will test for independence by conducting a chi-squared test statistic given a contingency table
(appendix 1).
Assumptions
The chi-squared test of independence is based on the following assumptions:
1. Two categorical values. This is in our case zodiac and grass.
2. Randomization
3. Expected count ≥ 5 in all cells. The expected cell count can be calculated by the following
(π‘…π‘œπ‘€ π‘‘π‘œπ‘‘π‘Žπ‘™)π‘₯(πΆπ‘œπ‘™π‘’π‘šπ‘› π‘‘π‘œπ‘‘π‘Žπ‘™)
formula:
π‘‡π‘œπ‘‘π‘Žπ‘™ π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒
. Thus, the expected cell count in the first cell can be calculated as:
92π‘₯575
1202
=
44. Using this formula in all cells of the contingency table, we see that the third assumption is also
satisfied. All of the three assumptions are therefore satisfied and we can conduct a chi-squared test. We
choose a confidence level of 95%.
Hypotheses
Our null hypothesis (𝐻0 ) is that zodiac and grass are independent, i.e. that zodiac does not play a role on
respondents’ attitudes towards grass. Our alternative hypothesis (π»π‘Ž ) is that the two categorical variables are
in fact dependent, i.e. zodiac does play a role on respondents’ attitudes towards grass.
Test statistic
To be able to reject or confirm the null hypothesis, we need to find the chi-squared value. The formula for
finding the chi-squared value is: 𝑋 2 = ∑
(π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ π‘π‘œπ‘’π‘›π‘‘−𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 π‘π‘œπ‘’π‘›π‘‘)2
.
𝑒π‘₯𝑝𝑑𝑒𝑑 π‘π‘œπ‘’π‘›π‘‘
count in the first cell is 44, thus the chi-square value of the first cell is:
As we saw above, the expected
(43−44)2
44
= 0,0232. The chi-squared
value for each cell is also written in the contingency table (appendix 1). Adding together the chi-squared
values of all the 24 cells, we get a final chi-squared value of 7,3938. As we know, the larger the chi-squared
value, the greater is the evidence against 𝐻0 , i.e. against independence. However, it can be hard to interpret
whether a number is large or not, which is why we need to find a P-value in order to be able to reject or
confirm our null hypothesis.
P-value
To find the P-value for the chi-squared statistic, we use the sampling distribution of the chi-squared statistic.
Given the degrees of freedom, the P-value is the right-tail probability above the chi-squared value and
Page 7 of 21
measures the probability of the chi-squared statistic for a random sample being larger than the observed
value. The degrees of freedom can be calculated as: df = (rows - 1)x(columns-1). Thus, df is: (12-1)x(21)=11. Given df = 11 and 𝑋 2 = 7,3938 we can find our P-value by looking at Table C2. For df=11 the lowest
chi-squared value shown is 13,7 with an associated P-value of 0,250. Since 𝑋 2 =7,3938 falls below this, it has
a larger right-tail probabily. Thus, the P-value is >0,25.
Conclusion
With a P-value larger than 0,25 and a confidence level of 95%, we cannot reject our 𝐻0 hypothesis. Based on
this sample, we therefore have evidence to support that an association does not exist between zodiac and
grass.
Question 3
This question concerns a comparison of wordsum between male and female respondents. First, we pursue to
give a 95% CI for the difference between the means for thereafter to test the difference between the sexes
with a statistical test. Second, we compare wordsum between the different classes.
Comparison of wordsum between female and male respondents
By constructing two histograms for female and male respondents and their results in a wordsum test
(appendix 2), we see that their distributions are approximately symmetric with means only differing slightly
with the mean for females being 5,9395 and 5,8988 for males. The standard deviation of females is 2,056
and while for men 2,046 (given by JMP). A potential outlier can be defined as an observation falling more
than 3 standard deviations from the mean in a bell shaped distribution. Thus, observations may fall between
5,9395±(3*2,056) = (-0,23;12,11) for females and 5,8988±(3*2,046) = (-0,24;12,04) for males to not be an
outlier. Since our observations can only fall between 0 and 10 we therefore have no outliers.
When calculating a confidence interval for the difference between two means, we need to first calculate the
standard error. The standard error expresses how well the difference between the sample means (π‘₯Μ…1 − π‘₯Μ…2 )
𝑠2
estimates the difference between the population means. The equation is: se = √𝑛1 +
1
𝑠22
.
𝑛2
As it can be seen in
the picture from JMP the standard error equals 0,11517. For a 95% confidence interval the margin of error is
approximately two standard errors, so the equation for the difference between the population means is: (π‘₯Μ…1 −
π‘₯Μ…2 ) ± 2(se). With the respective means of 5,9395 (women) and 5,8988 (men) the confidence interval
becomes: (-0,26672 ; 0,18518). As the confidence interval for the difference between male and female
responses regarding wordsum contains the value of 0 we cannot claim that there is a strict difference
2
Retrieved from Learn d. 27-12-2014
Page 8 of 21
between male and female
respondents. When 0 falls in the
interval it is possible that πœ‡1 = πœ‡2 . The
following will conduct a t-test to
investigate whether this is true.
Assumptions
First, a quantitative response variable for two groups. This is in our case wordsum. Second, independent
random samples. Third, pproximately normal population distribution for each group.
Hypothesis
Ho: πœ‡1 = πœ‡2 , i.e. the two means are the same.
Ha: πœ‡1 ≠ πœ‡2 , i.e. the two means are not the same.
Test-statistic
We conduct a significance test comparing the two means. The test will give us a t-test with an associated pvalue with which we will be able to reject or not reject the null-hypothesis. We use the following equation: t
=
Μ…Μ…Μ…Μ…−
(π‘₯
π‘₯Μ…Μ…Μ…)−0
1 Μ…
2
.
𝑠𝑒
Given our means and our standard error, t equals:
(5,8988− 5,9396)−0
0,11517
= -0,354.
P-value
Given the t-value of -0,354 and df equal to 1240 (given by JMP) we get a two-sided p-value of 0,7234 by
using software.
Conclusion
As we know, when the p-value is higher than the significance level (in our case 0,05), we cannot reject our
null-hypothesis, thus we cannot reject that the two means are the same.
Comparing wordsum between the different classes
As the different values of classes are treated as categorical variables and the wordsum is a continuous
variable we can compare these two with an analysis of variance (ANOVA), in order to compare the means of
each group. With an ANOVA we can test for differences between the means in each class, i.e. to see if some
classes are better at spelling than others. With ANOVA we test all the means simultaneously, both including
variance within and between the classes.
Page 9 of 21
As we are only comparing wordsum to class this is a one-way ANOVA with 4 levels (coinciding with 4
categories in the factor), which is why we are testing 4 means (one from each class).
Assumptions: First, normal distribution. Seocnd, equal standard deviations in the populations and finally
randomization.
Hypotheses: To test for variance between the means we have a null hypothesis that states that the means are
all equal, thus the difference between the group means and the overall means are all zero.
H0: ∝1 = ∝2 = ∝3 = ∝4 = 0.
H1: Some group means, ∝𝑖 , are not equal.
Test Statistic: By making an F test, we assume an F distribution where all the values are positive and the
distribution is positively skewed.
The total variability represents the sum of squared deviations from the mean and thus can be written as;
∑𝑖,𝑗(π‘₯𝑖𝑗 − π‘₯ΜΏ )2 = ∑𝑖 𝑛𝑖 (π‘₯̅𝑖 − π‘₯ΜΏ )2 + ∑𝑖,𝑗(π‘₯𝑖𝑗 − π‘₯̅𝑖 )2 = SStot = SStreat + SSerror,
where: π‘₯ΜΏ = π‘”π‘Ÿπ‘Žπ‘›π‘‘π‘šπ‘’π‘Žπ‘› = 5,9337, π‘₯̅𝑖 = groupmean ,(π‘₯̅𝑖 − π‘₯ΜΏ ) = variability between groups and lastly
(π‘₯𝑖𝑗 − π‘₯̅𝑖 ) = variability between groups.
To find the F ratio we are dividing the mean square deviation between-groups with the one within-groups:
Between-group: MSEtreat =
π‘†π‘†π‘‘π‘Ÿπ‘’π‘Žπ‘‘
π‘Ÿ−1
and within-group: MSEerror =
π‘†π‘†π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
𝑛−π‘Ÿ
𝑀𝑆𝐸
F = π‘€π‘†πΈπ‘‘π‘Ÿπ‘’π‘Žπ‘‘ ~ Fr-1, n-k
π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
SStreat = nA(π‘₯Μ… A - π‘₯ΜΏ )2 + … + nD(π‘₯Μ… D - π‘₯ΜΏ )2
= 135*(5,22963 – 5,933702)2 + … + 40*(6,924 – 5,933702)2
= 268,3687
SSerror = 4985,0622
F=
π‘†π‘†π‘‘π‘Ÿπ‘’π‘Žπ‘‘
(π‘Ÿ−1)
π‘†π‘†π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
(𝑛−π‘Ÿ)
=
268,3687
(4−1)
4985,0622
(1267−4)
= 22,664353
We expect JMP to get results differing slightly from ours
as software mostly use more sophisticated
formulas to achieve high degree of accuracy.
Thus, the F ratio in JMP is equal to 22,4110 (see
table).
P-value: F ratio = 22,4110 implies a critical
value far out in the distribution, additive a P-value of 0,0001 (given from JMP) represents significance in our
test and we can reject our null hypothesis.
Page 10 of 21
Conclusion: As we can reject our null hypothesis, it implies that there is difference between at least two
group means. Consequently, there must be difference in the level of spelling between classes. The clear
difference we can see between the ability to spell correct between classes might not be explained best by
class itself´, but by education. Using class as an explanatory variable might very well capture the effect of
education as well. We can see a clear correlation between the wordsum test and class, but as usual we cannot
apply any causality and claim that class is the reason for the difference.
Question 4
In this question we perform a linear regression where wordsum is described by age. Furthermore, we
compute a 95% confidence interval for the slope of the regression and consider if a quadratic model fits the
data better.
Linear regression (wordsum described by age) and confidence interval
Performing a linear regression in
the software JMP the regression
equation is: Y=b+ax, where
wordsum is the response variable
(y) and age is the explanatory variable (x). In this case: wordsum = Intercept + age*x = 5,43478+0,01025x
(See graphical illustration in Appendix 3). At first we notice that the P-value is far below the desired value of
5%, namely 0,0016, meaning that the regression model is statistically significant. When we move on to the
parameter estimates it becomes clear that both the intercept and the slope are also statistically significant.
The intercept should be interpreted as the expected wordsum for a person aged 0. Since the minimum age of
the survey is 18 this is not a result with which we can make any inference about the population. The slope of
0,01 means that for each year you get older you are expected to answer 0,01 more question correct. Age is
related to the response of the wordsum test, but it does only constitute a small effect as you would have to
age a 100 years to get one more question correct. As the result of the wordsum test is a discrete variable
taking a value from 0 to 10, we see that age seems to carry an effect that is only visible on average and does
not make for any differences on the individual level. Another
important part of the analysis of this regression is the effect of
the very low Rsquare, meaning that age does not count as a
major determinant. With an Rsquare of 0,007 age can explain
approximately 0,7% of the already small movements described
by the simple linear regression.
Moving on to CI, we see a 95% CI for the slope of the regression is between 0,0038958 and 0,0166035. This
reflects the small effect of age influencing the number of correct words in the wordsum test.
Page 11 of 21
To conclude we cannot describe age as having a powerful effect on the wordsum even though it is
statistically significant.
Quadratic model
The quadratic model can be seen in appendix 4. Fitting a quadratic model for wordsum described by age
seems to generate a better fit with the data, i.e. produce smaller residuals. The equation for the quadratic
model is: wordsum = 5,4840382 + 0,0122259*age - 0,0004581*(age-48,1935)2. Comparing the quadratic
model with the linear regression we see that the Rsquare is
larger for the polynomial than for the linear regression model.
The Rsquare of the polynomial is 0,013 and thus not far from
twice as large as for the linear model.
Taking the same approach as for the linear
regression we see that the quadratic model is
statistically significant, as are all the
explanatory parameters when applying our CL.
Comparing the individual P values of the explanatory variables it becomes apparent that they are very small
and not that far from each other for both models. Therefore we do not draw any conclusions of which model
is better at describing the data on the basis of the P values. The main difference between the two models is
that the linear model assumes a constant marginal effect of the explanatory variable, i.e age, whereas the
quadratic model allows for different effects for different levels of the explanatory variable. In this particular
case the model tells us that there is a cutoff at the age of 48 at which one’s results in the wordsum test seems
to deteriorate. In conclusion, the quadratic model does seem to produce a better fit as both the Rsquare is
greater and the coefficients show significance.
Question 5
In this question we are conducting a multiple linear regression analysis by adding sex, education, class and
real income to the regression model to investigate to what extent they are explanatory to wordsum.
a) Additive model
Y = 1,4145+ 0,01279*age + 0,02672*sex(female) +
0,28092*education + (certain)class + 4,0441*realinc.
Analysing the data given from the model in JMP, we note that according
to the F-test we can conclude that at least one of the variables is
statistically significant. We note that the variables “sex” and all those
Page 12 of 21
related to class are statistically insignificant. The
coefficient “intercept” seems significant, but the
interpretation of it, however, makes it less significant.
As we have used categorical explanatory variables, the
intercept is interpreted as the correct number of answers
in the wordsum test for an upper class, male, aged 0
and with 0 real income. As this is not a type of person observed in the data, nor does the existence of such a
person seem probable, we cannot use this coefficient to draw any inferences.
The Rsquare of the model is just above 23% and this means that the model is able to explain approximately
23% of the wordsum test.
b) Assumptions, nonlinear effects/ interactions and modification of the model
In JMP the method, Ordinary least squares (OLS) is used in order to
make the residuals as small as possible, thus finding the best
suitable regression. An assumption in using OLS is residuals being
approximately normal distributed. In this case the residuals have a
skewness factor of -0,252. To make sure that the assumptions of
the linear regression are not violated we plotted the residuals and
calculated the skewness factor. The skewness factor is sufficiently
Figur 1: Residuals of wordsum, JMP
close to 0 to claim that the residuals are approximately normally
distributed and implies that: 1) The error term has a population mean of 0. 2) Errors are uncorrelated and 3)
the error term has constant variance. We also check that: All explanatory variables are uncorrelated with the
error term and that no explanatory variable is a perfectly linear function of another (appendix 5). As all these
assumptions are fulfilled we confirm the results of the model.
Possible nonlinear effects
The effect of real income when logarithmically transformed
on wordsum Rsquare moves from 0,066 to 0,111 and
provides a better fit. Graphing the means of real income
relating to wordsum will also show something similar to a logarithmic
function. None of the other variables shows any tendencies of strong
non-linear effects. This has been checked by graphing the individual
explanatory variables in relation to the wordsum test (appendix 5).
Intuitively not many possible interactions seem to be present. When searching for interactions it might at
first seem sensible to check for an interaction between real income and class as these two variables seems to
Page 13 of 21
describe nearly the same. However, this is misleading as these two variables both seem to be a measure of
income therefore we would not expect to see that a change in real income would affect the impact that class
has on the wordsum test.
Of all the possible interactions we have tested for, none seems to be even slightly significant. The
coefficients of the interaction terms have all been very close to 0 and the P-value corresponding to the T-tests
have all shown to be way below our CL. Possible interactions are found by multiplying two variables and
treating the result as a variable in its own right, while still
including the original two variables in the regression
model.
As none of the interaction terms we tested for proved to
have a statistically significance our model now takes into
account: Age, log of real income, class and education.
We have removed the variable sex as it failed to provide any
predictive value on the wordsum test. Even though we earlier
established that age has a higher predictive value
when fitted as a quadratic polynomial we include it
as linearly in our final model, as we cannot integrate
a 2nd degree polynomial into a multiple linear
regression.
c) Analysing the statistical significance of the predictors
The significance of the predictors: Age, education and real income all perform very well according to their
statistical significance. Class on the other hand now performs worse than in the previous model and all the
different class categories are now labeled as insignificant. Thus, we cannot deny the hypothesis that their
coefficients are 0.
In a simplification of the model we eliminate insignificant
predictors, such as class and sex (that one we removed just
prior to this subsection). Consequently, an optimal model
with the information given to predict wordsum is including
the predictors: age, education and log of real income.
Deduced from JMP we now have an F ratio equal to 124,5 and a Rsquare
equal to 0,2456. This reflects a relatively good model for predicting
wordsum. The model before being simplified
actually had a Rsquare 0,0009 higher than the
Page 14 of 21
present one, however as we have a much higher F ratio, and all predictors show significance and have
positive CI (positive relation between each predictor and wordsum). The current model is given our analysis
the best. The decrease in Rsquare is very small and thus a small price for increasing the significance of the
explanatory variables, (remember we removed class) and increasing the overall F-test. Conversely one could
say that the slightly higher Rsquare in the first model is due to the inclusion of class, but as this is an
insignificant variable we do not wish to use it for artificially increasing our Rsquare.
A 95% CI for the effect of education is given in JMP: 0.2278;0.3036. Thus we can with 95% confidence say
that education increases our prediction of correct answers in the wordsum test with 0,2278 to 0,3036 per
extra year of education, holding the other variables constant.
Question 6
a) Logistic regression model prediction grass from
education
Logistic regression is used when our response variable is
categorical. As we only have two categories in the
response variable, it can be defined as a dummy variable
with the binary outcome being coded as 0(against
legalization) or 1(for legalization).
The population mean is equal to the population proportion
of people being for legalization, µy = p, hence it must be
between 0 and 1 for all positive values of x as long as the
𝑒 𝛼+𝛽π‘₯
numerator is smaller than the denominator in the following logistic regression equation: p = 1+ 𝑒 𝛼+ 𝛽π‘₯ .
Consequently, the shape of the regression becomes more realistic as an S-shape rather than a linear trend. β
does not represent the slope, but determines the steepness of the curve as β increases, the curve will become
steeper.
Intercept = α = -1.1410444 and β = 0.07687169
The minimum value of our explanatory variable, years of education, is 0, thus 𝑝̂ = 0,24213 (probability of
being pro-legalization with 0 years of education). The maximum value of years of education is 20, such that
𝑝̂ = 0,59782. There is an upward sloping trend in education and being pro-legalization
When a person has 14.8 years of education, the probability of being pro or against legalization is equally
likely (p = 0.5. x = -α/β = 14.8434931)
Page 15 of 21
The odds ratio represents the odds that an outcome
Odds p/(1-p)
0 yrs of education
1 yr of education
Pro
0.3194875
0.345013162
Against
3.1300128
2.89843789
(pro-legalization) will occur given education,
compared to the odds of the outcome occurring in the
absence of education (or with a year less of education).
Odds ratio for being pro legalization given 1 year of
probability (p)
0 yrs of educ
1 yr of education
Pro
0.24213
0.256513
Against
0.75787
0.743487
education = 0.345013162/0.3194875 = 1.079896.
Odds ratio against legalization given one year of
education = 3.1300128/2.89843789 = 0.9260.
The Unit Odds ratio equals 1.079904 (given from JMP) which means that for each additional year of
education the likelihood for the person to be pro-legalization of grass will be 1.0799 times higher than
before. As the unit odds ratio is relatively small (close to 1) it coincides with education as not being very
explanatory in the observed persons’ attitudes towards grass.
Given from JMP a 95% CI for the odds ratio = (1.041519;
1.120413). This is a relatively small CI which coincides
with high precision in our odds ratio, moreover as the
interval does not cross 1, we know with 95% confidence that education has a positive effect in being prolegalization of grass.
To evaluate the predictive powers for the whole model we
investigate the sensitivity and specificity measures, as
logistic regression is not compatible with R2. Sensitivity is
the ability to predict an event correctly, specificity is the
ability to predict a non-event correctly.
Sensitivity is calculated by dividing the correctly predicted events with the observed events:
225/(225+361)=0,384 or 38%. Specificity is calculated by dividing the correctly predicted non-events with
the observed non-events: 435/(213+435)=0,671 or 67%. Furthermore, 213/(225+213)= 48% of predicted
events are false positives and 361/(361+435)= 45% of predicted non-events are false negatives.
Consequently, specificity and sensitivity implies that we should be cautious in using the logistic regression
model to correlate the predictions in the model with the population.
b) Multiple logistic regression
We now extend the model to a multiple logistic
regression predicting grass from the variables age,
Page 16 of 21
class sex and wordsum. The parameter estimates can be seen in the JMP output.
The formula for the multiple logistic regression prediction equation expands to: 𝑝̂ =
𝑒 −0,54+0,0073∗𝑒𝑑𝑒𝑐−0,019∗π‘Žπ‘”π‘’+(π‘π‘’π‘Ÿπ‘‘π‘Žπ‘–π‘›)π‘π‘™π‘Žπ‘ π‘ −0,288∗𝑠𝑒π‘₯(πΉπ‘’π‘šπ‘Žπ‘™π‘’)+0,2442∗π‘€π‘œπ‘Ÿπ‘‘π‘ π‘’π‘š
.
1+ 𝑒 −0,54+0,0073∗𝑒𝑑𝑒𝑐−0,019∗π‘Žπ‘”π‘’+(π‘π‘’π‘Ÿπ‘‘π‘Žπ‘–π‘›)π‘π‘™π‘Žπ‘ π‘ −0,288∗𝑠𝑒π‘₯(πΉπ‘’π‘šπ‘Žπ‘™π‘’)+0,2442∗π‘€π‘œπ‘Ÿπ‘‘π‘ π‘’π‘š
The positive parameter estimates such
as education, lower class and wordsum implies a positive relationship with attitude towards grass, thus
having a higher estimated probability of being pro-legalization of grass when having higher education or the
like. Vice versa for the negative coefficients it implies a negative effect on the attitude towards grass with a
lower estimated probability of being pro-legalization.
The P-values for respectively age, middle class, sex and wordsum all show significance with P-values below
our significance level, coinciding with strong evidence of the corresponding population effects on grass
being positive. However education, lower class and middle class all have P-values above our significance
level, coinciding with them not being explanatory when predicting grass.
(c) Statistical tests of significance for each predictor
JMP reports the Likelihood Ratio test with Chisquare tests to
check for significance for the different coefficients. In both the
parameter estimates and the Likelihood Ratio tests both age, sex
and wordsum reports a P-value less than 0,05, showing
significance in all the parameters. Consequently, we can reject
the null hypothesis in these categories, saying that the attitude to legalization of grass does depend on these
parameters. However in the parameter estimate class, only middle class shows significance with a P-value
less than 0,05, the null hypothesis can be rejected, as there is a difference in attitude between upper- and
middle class. Because one of the 4 categories in class has shown significance, the effect Likelihood Ratio
test also reports an overall significance; we should therefore not reassess the multiple logistic regression
without this parameter. The effect likelihood ratio test for education shows a very high P-value, much greater
than our significance level of 0,05, consequently this parameter should be removed in a modification of the
model, has no predictive power for the population’s attitude towards grass.
Without education the explanatory variables: Age, class,
sex and wordsum reports significance in the Likelihood
ratio test, they do significantly explain the populations
attitude towards grass.
Page 17 of 21
Appendix
Appendix 1 – Question 2
Appendix 2 – Question 3
Page 18 of 21
Appendix 3 – Question 4
Appendix 4 – Question 4
Page 19 of 21
Appendix 5 – Question 5
Checking for possible non-linear relations with wordsum as response variable.
A non-linear relationship between realinc and wordsum seems to be the case. A log function shows good
explanatory power in JMP
The relationship between the means of wordsum corresponding to age does not indicate any obvious choice
of model other than that of a linear.
Page 20 of 21
The relationship between the means of wordsum corresponding to education does not show any clear type of
correlation that would lead us to transforming the variable to anything else than a linear coefficient when
performing our regression.
A similar conclusion is drawn from this graph as from the previous.
Page 21 of 21
Download