Assignment in Statistics and Research Methods

advertisement
Assignment in Statistics and
Research Methods
Indian Retail Stores
Winter 2013/2014
Kirsten Marie Simonsen – XXXXXX-XXXXKia Slæbæk Jensen – XXXXXX-XXXX
Jasmin Sharzad – XXXXXX-XXXX Malene Louise Thomasen – XXXXXX-XXXX
CTU: 25,230
BSc IBP
Statistics and Research Methods
Winter 13/14
Table of Contents
Question 1 ......................................................................................................................................................... 3
Question 2 ......................................................................................................................................................... 4
Two Group Comparison................................................................................................................................. 4
Three Group Comparison .............................................................................................................................. 6
Question 3 ......................................................................................................................................................... 9
Two Group Comparison................................................................................................................................. 9
Three Group Comparison ............................................................................................................................ 11
Question 4 ....................................................................................................................................................... 12
Question 5 ....................................................................................................................................................... 13
Question 6 ....................................................................................................................................................... 14
a) Additive Model ........................................................................................................................................ 14
b) Fulfillment of Model Assumptions .......................................................................................................... 15
c) Statistical Significance of Predictors ........................................................................................................ 16
Question 7 ....................................................................................................................................................... 17
a) Logistic Regression Model ....................................................................................................................... 17
b) Multiple Regression Model ..................................................................................................................... 18
c) Significance: Perception .......................................................................................................................... 18
Page 2 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Question 1
This question concerns a description of the variables in the data set. Furthermore, the minimum and
maximum store size will be calculated in both square feet and square meter.
Variables are the characteristics observed in a study. Thus, the variables in the data file from the research
papers: “Competition and labor productivity in India’s retail stores” and “Are labor regulations driving
computer usage in India’s retail stores?” are LogSize, Competition, Perception, Efficiency3yr, Efficiency,
Logsales3yr, Logsales, Store type, City and Computer use.
A variable can be either quantitative or categorical. The quantitative variables in the data set are
LogSize, Competition, Perception, Efficiency3yr, Efficiency, Logsales3yr and Logsales, since their values can
take any value within a certain interval. Furthermore the variables are continuous. LogSize, Logsales3yr
and Logsales have been converted into base 10 logarithm in order to diminish the spread of the data set.
The categorical variables in the data set are: Store type, City and Computer use as they all consist of
categories for which the concerned observation belongs to a certain category.
As for making descriptive statistics of quantitative and categorical variables graphs and numerical
summaries describe the main features of the variables. For quantitative variables, key features to describe
are the center and the variability, why a histogram is usually applied as to describe the data. Opposite, for
categorical variables, a key feature to describe is the relative number of observations in the various
categories. Consequently, a bar graph is commonly used as to describe the data.
Minimum and maximum
We will now describe the minimum and maximum of the logSize variable. As the store size will be
calculated in square feet we will lastly convert it into square meter.
By constructing a histogram in SAS JMP (see appendix 1), we get the maximum and minimum values of the
logSize variable. Since the data is given in the log(10) base we use the following formula x=logy to obtain
the minimum and maximum values in square feet:
Min: 101,079 = 11.995 sq.ft.
Max: 104,176 = 14996.848 sq.ft.
These values can be converted into square meter by using the following formula:
Page 3 of 25
BSc IBP
m2 ο€½
Statistics and Research Methods
Winter 13/14
ft 2
10.764
So by inserting the above results into the formula we obtain the following results:
11.995
Min: 10.764 = 1.114 sq.m
Max:
14,996.848
10.764
= 1393.241 sq.m
Thus it can be concluded that there is a wide spread in the LogSize variable, meaning that there is a large
difference in the store sizes.
Question 2
This question regards a comparison of the store types in terms of their level of efficiency.
We will begin with constructing a significance test and a confidence interval for the two group comparison,
followed by a significance test and a confidence interval for the three group comparison.
Two Group Comparison
Significance Test
1. Assumptions
Since we are comparing means, the first assumption of the significance test is that the response variable
has to be quantitative, which is the case of the efficiency variable.
The second assumption suggests that the sample must be collected using randomization. The sampling
methodology for Enterprise Surveys is stratified random sampling. In a stratified random sample, all
population units are grouped within homogeneous groups and simple random samples are selected within
each group.
The third assumption states that each group must have an approximately normal distribution. By looking at
the histograms (see appendix 2) for each group, it can be observed that this is the case of both store types.
2. Hypotheses
The null hypothesis assumes that the two means are equal, and thus there is no difference in the level of
efficiency: 𝐻0 : πœ‡1 = πœ‡2
The two-sided alternative hypothesis suggests that the means are different from one another, and thereby
implying an association between efficiency and store type: π»π‘Ž : πœ‡1 ≠ πœ‡2
Page 4 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
3. Test Statistic
The test statistic states the distance between the value of the null hypothesis and the point estimate
parameter, wherewith the amount of standard errors determines this distance. The test statistic has
approximately a t distribution if H0 is true.
The following formula is used to construct the test statistic:
tο€½
( x1 ο€­ x 2 ) ο€­ 0
se
The standard error can be calculated by using the following formula:
se ο€½
s12 s 22

n1 n 2
The standard error is then derived to be:
se ο€½
s12 s 22

ο€½
n1 n2
0.450465 2 0.45865 2

ο€½ 0.061161988 ο‚» 0.06
68
278
And the t test statistic is calculated to be:
tο€½
(5.7624265 ο€­ 5.3656043) ο€­ 0
ο€½ 6.488052677 ο‚» 6.488
0.061161988
4. P-Value
In order to infer the value from the test statistic, the p-value describes the probability that the test statistic
takes the observed value or a value more extreme. In this case the two-tail probability from the t
distribution will be used to construct the p-value, which has to be smaller than the significance level of
α=0.05 if H0 is to be rejected. We use table B on page A-3. Since df is larger than 100 and thereby
approximates infinity, we can conclude from the table, that when we have a t test statistic of 6.488, the
right tail probability must be significantly smaller than 0.001 which can be seen in table B. The P-value must
therefore be smaller than 0.002.
5. Conclusion
Seeing as the p-value is smaller than the significance level of 0.05 we can reject H0. Therefore we can
conclude that there is a difference in efficiency between the two store types.
Page 5 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Confidence Interval
Since the response variable is quantitative, we are comparing means. Using SAS JMP (see appendix 2) to
obtain the means, we have constructed a confidence interval for the difference in efficiency between the
means of the store types Traditional and Consumer Durables.
The difference in means will be calculated by using the following formula for constructing a confidence
interval:
( x1 ο€­ x2 ) ο‚± t 0.025 οƒ— se
Here df = n – 1, for the t-score t0.025 equals df = (68+278)-1 = 345. Since the degrees of freedom exceeds
100 we use t0.025 = 1.96.
Where the standard error (se) is calculated to be:
se ο€½
se( x )  se( x )
2
1
2
2
ο€½
s12 s 22

ο€½
n1 n2
0.450465 2 0.45865 2

ο€½ 0.061161988 ο‚» 0.06
68
278
The confidence interval is then:
(5.7624265ο€­ 5.3656043) ο‚± 1.96 οƒ— 0.06 ο€½ 0.3968222ο‚± 0.1176 ο€½ (0.2792222;0.5144222) ο‚» (0.279;0.514)
Thus we can be 95% confident that the population mean difference for efficiency (μ1 – μ2) between
Consumer Durable Stores and Traditional FMCGs falls between 0.279 and 0.514. We can then infer that the
efficiency in the Consumer Durable Stores is between 0.279 and 0.514 points larger than the efficiency in
the Traditional FMCGs measured on a scale from 1 to 4.
Three Group Comparison
Significance Test
When comparing several means, the analysis of variance method is used by constructing a one-way ANOVA
test (see appendix 3). It investigates independence between efficiency and the three store types.
1. Assumptions
The first assumption states that the distributions for the groups are normal, where the standard deviations
for each group are the same. The standard deviations are not completely identical, with a difference from
the largest to the smallest of 0.064, but since the difference is smaller than 2, the general formula can be
used for calculation. Additionally the second assumption assumes randomization, which is fulfilled(see two
group comparison).
Page 6 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
2. Hypotheses
The null hypothesis suggests that the means for each group are equal, thus there is no difference in the
level of efficiency between the store types. 𝐻0 : πœ‡1 = πœ‡2 = . . . = πœ‡π‘”
The one-sided alternative hypothesis states that at least two of the means are unequal, thus suggesting an
association between efficiency and store type.
3. Test Statistic
The ANOVA test has an F distribution, with two degrees of freedom values, which has a mean equal to
approximately 1, when H0 is true.
F test statistic:
𝐹=
𝐡𝑒𝑑𝑀𝑒𝑒𝑛 πΊπ‘Ÿπ‘œπ‘’π‘π‘  − π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦
5.93626
=
= 29.2303 ≈ 29.23
π‘Šπ‘–π‘‘β„Žπ‘–π‘› πΊπ‘Ÿπ‘œπ‘’π‘π‘  − π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦
0.20309
4. P-Value
The p-value is the right-tail probability from the F distribution
𝑑𝑓1 = (g − 1) = 3 − 1 = 2
𝑑𝑓2 = (n − g) = 388 – 3 = 385
Using Table D on page A-5 we are able to find that, if the F test statistic is 3 or above then the right-tale
probability must equal 0.05 or smaller. Seeing as the F test statistic is 29.23, it is far larger than 3, therefore
the P-value must be much smaller than the significance level 0.05.
5. Conclusion
JMP reports the P-value to be smaller than 0.001, therefore we can reject the null hypothesis. Thus is can
be concluded that at least two of the groups have different means.
Confidence Interval
The confidence interval is used to estimate the differences between population means. The following
formula is used to construct a 95% confidence interval for two groups with different:
yi ο€­ y j ο‚± t 0.025 s
1 1

ni n j
The t score has df = N – g
We will use the formula to obtain the confidence interval for the difference in the means for the three
store types.
Page 7 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Consumer Durable Stores (CDS) vs. Modern Format Stores (MFS)
Common standard deviation from Mean Square Error, we got from JMP (see appendix 4):
s ο€½ MS ο€½ 0.20309 ο€½ 0.4506550787
yCDS ο€­ y MFS ο‚± t 0.025 s
1
nCDS

1
ο€½ 5.7624265 ο€­ 5.7359535 ο‚± 1.96 οƒ— 0.4506550787
nMFS
1
1

ο€½
68 43
(ο€­0.1456239392;0.1985699392) ο‚» (ο€­0.146;0.199)
Since the 95% confidence interval (-0.146; 0.199) contains zero we cannot reject that the mean of CDS
equals the mean of MFS.
MFS vs. Traditional FMCGs (TFMCG)
y MFS ο€­ yTFMCG ο‚± t 0.025s
1
nMFS

1
nTFMCG
ο€½ 5.7359535 ο€­ 5.3656043 ο‚± 1.96 οƒ— 0.4506550787
1
1

ο€½
43 278
(0.225606646;0.0.515091754) ο‚» (0.226;0.515)
Seeing as the confidence interval does not contain zero, we can conclude that the two means differ, and
that the efficiency of the Modern Format Stores will be between 0.226 and 0.515 larger than the efficiency
in the Traditional FMCGs.
CDS vs. Traditional FMCGs
As we have already made a confidence interval for the difference between Consumer Durable Stores and
the Traditional FMCGs, concluded that with a 95 % confidence the difference between the two types of
store’s population mean lies in the interval (0.279; 0.514). Here it is also evident that the Consumer Durable
Stores are more efficient than the Traditional FMCGs.
Since the confidence interval for the Consumer Durable Stores and the Modern Format Stores contains
zero, we cannot reject that their population means might equal each other, so they might be equally
efficient. However, we can conclude that the Traditional FMGCs are statistically significantly less efficient
than the Consumer Durable Stores and the Modern Format Stores.
Page 8 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Question 3
This question considers a comparison of the probability of computer use in relation to store type.
Two Group Comparison
Significance Test
1. Assumptions
Since this question concerns a comparison of proportions, the first assumption states that the response
variable for the two groups has to be categorical. The second assumption is concerned with the sample size,
which has to be large enough so that there are at least five successes and five failures for each of the two
groups. The third assumption states that the data must be collected by using randomization. All the
assumptions are fulfilled.
2. Hypotheses
The null hypothesis assumes that the two proportions are equal, thereby suggesting that there is no
association, 𝐻0 : 𝑃1 = 𝑃2
The two-sided alternative hypothesis assumes that the two proportions are different from one another,
thereby suggesting that there is an association between computer use and store type, π»π‘Ž : 𝑃1 ≠ 𝑃2
3. Test Statistic
When comparing proportions we do a z statistic, which describes the distance between the sample
estimate and the null hypothesis value, measured in the number of standard errors. We obtained the
proportion of computer use in the different store types by using SAS JMP (see appendix 5).
se ο€½
pˆ 1 (1 ο€­ pˆ 1 ) pˆ 2 (1 ο€­ pˆ 2 )

n1
n2
se ο€½
0.0537(1 ο€­ 0.0537) 0.0311(1 ο€­ 0.0311)

ο€½ 0.028674012 ο‚» 0.03
71
283
zο€½
( pˆ 1 ο€­ pˆ 2 ο€­ 0)
se
zο€½
(0.0537 ο€­ 0.0311 ο€­ 0)
ο€½ 0.789
0.03
4. P-Value
Page 9 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
The P-value is the two tail probability of a normal distribution which can be obtained from Table A on page
A-2 in the book.
We obtain a cumulative probability of 0.7852, but since it is the area under the standard normal curve to
the left of z, and we are interested in the probability to the right of z, we subtract the cumulative
probability from 1. But since we need the two tail probability from the standard normal distribution of
values even more extreme than the observed z test statistic, we multiply by 2 and get the following p-value.
(1 ο€­ z ) οƒ— 2 ο€½ (1 ο€­ 0.7852) οƒ— 2 ο€½ 0.4294
5. Conclusion
We can conclude that since the P-value is larger than the significance level, we cannot reject the null
hypothesis. Thus there is not necessarily an association between computer and store type concerning
traditional and consumer durable stores.
NOTE
We made a huge mistake when handling in the assignment, which we saw afterwards. We should have
used the pooled estimate! When doing this we got a new z-score which is 6.19 with (p<0,00001)
Calculations are:
pooled e = (19+11)/(71+283) = 0.084746
se = kvad. rod(0.08476 * (1 - 0.08476) * (1/71+1/283))
z = (0.267-0.0389-0) / (kvad. rod(0.08476 * (1 - 0.08476) * (1/71+1/283)))
z = 6.18
The P-value is extremely low so we reject H0, which means that our conclusion is that there IS a connection
between computer use and storetype
Confidence Interval
We will now construct a confidence interval, by using the following formulas for finding the standard error
and the confidence interval.
( pˆ 1 ο€­ pˆ 2 ) ο‚± z 0.025 ( se)
se ο€½
pˆ 1 (1 ο€­ pˆ 1 ) pˆ 2 (1 ο€­ pˆ 2 )

n1
n2
Page 10 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
0.0537(1 ο€­ 0.0537) 0.0311(1 ο€­ 0.0311)

ο€½ 0.028674012 ο‚» 0.03
71
283
se ο€½
(0.0537 ο€­ 0.0311) ο‚± 1.96 οƒ— 0.03 ο€½ 0.0226 ο‚± 0.0588 ο€½ (ο€­0.0362;0.0814)
ˆ1 ο€­ pˆ 2 ) equals zero, which means that the
Since the confidence interval contains zero, it is plausible that ( p
population proportions might be equal. This indicates that there is not necessarily an association between
computer use and Traditional FMCGs or Consumer Durable Stores.
NOTE
Given the last note, we have calculated a wrong CI, as we should have used conditional proportions (row%
in JMP).
(0.1232 ; 0.334) (NOTE it does NOT contain 0!!!)
Calculations:
se = kvad. rod (0.2676 * (1-0.2676) / 71 + (0.0389 * (1 - 0.0389) / 283)) = 0.0537
CI er da:
(0.02676 - 0.0389) +-1.96 * 0.0537
Our Ci tells us that there is approx. 12% to 33% bigger chance of having a computer in a consumer durable
store than in a traditional FMCG.
Three Group Comparison
Chi Squared Test
This is a test of independence, which compares the observed counts with the expected counts, by looking
at the difference between the two and summarizing the squares.
When H0 is true, thus assuming independence, the Chi squared test has a small value, and if there is an
association, the chi squared test has a large value.
X2=∑
(π‘‚π‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ πΆπ‘œπ‘’π‘›π‘‘−𝐸π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 πΆπ‘œπ‘’π‘›π‘‘)2
𝐸π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 πΆπ‘œπ‘’π‘›π‘‘
The expected cell count can be calculated by using:
𝐸π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 𝐢𝑒𝑙𝑙 πΆπ‘œπ‘’π‘›π‘‘ =
(π‘…π‘œπ‘€ π‘‡π‘œπ‘‘π‘Žπ‘™) · (πΆπ‘œπ‘™π‘’π‘šπ‘› π‘‡π‘œπ‘‘π‘Žπ‘™)
π‘‡π‘œπ‘‘π‘Žπ‘™ π‘†π‘Žπ‘šπ‘π‘™π‘’ 𝑆𝑖𝑧𝑒
SAS JMP (see appendix 6) provides the Chi Squared value (Pearson) to be 116.147. Since this is a rather
large number, we can conclude that there is an association between the three store types and computer
Page 11 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
use. Seeing as our two group comparison between Traditional FMCGs and Consumer Durable Stores gave
us a result where there was not necessarily an association, there must be a bigger chance of having a
computer in a Modern Format Store.
Question 4
This question concerns a comparison between the two variables Efficiency3yr and Efficiency.
In order to compare the two variables, we construct a new column in SAS JMP called EfficiencyEfficiency3yr, for which we get the change in efficiency. This will be followed by a significance test and a
confidence interval, in order to confirm whether or not there has been a significant change in efficiency.
Significance Test
1. Assumptions
Three assumptions apply to the comparison of means. The first assumption states that the variables need
to be quantitative, which is the case of the efficiency variables. The second assumption states that the data
set has to be collected using randomization. The third assumption states that the population distribution
has to be approximately normal. All the assumptions are fulfilled.
2. Hypothesis
The null hypothesis assumes that the means are equal, which indicates that there is no change in efficiency
during the three years. 𝐻0 : πœ‡ = πœ‡0
The two-sided alternative hypothesis assumes that the means differ, thus indicating that there has been a
change in efficiency during the three years: 𝐻0 : πœ‡ ≠ πœ‡0
3. Test Statistic
The test statistic uses the standard error to measure the distance between the sample mean
x (see
appendix 7) and the value of the null hypothesis μ0.
The following formula can be used to calculate the t test statistic:
tο€½
x ο€­  0  x ο€­  0 
se
ο€½
s/ n
=
0.066003ο€­ 0
0.3096432/ 338
ο€½ 3.918866589 ο‚» 3.92
4. P-Value
The p-value is a two tail probability of getting more extreme values than the t test statistic. We can obtain
the right tail probability of the t test statistic by using table B on page A-3. As df= n-1, df is larger than 100
and thereby approximates infinity, we can conclude from the table, that when we have a t test statistic of
Page 12 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
3.92, the right tale probability must be significantly smaller than 0.001. The P-value must therefore be
smaller than 0.002, why we can reject H0 as the P-value is smaller than the significance level of α=0.05.
Furthermore, SAS JMP provides us with the p-value of 0.0001, which is significantly smaller than the
significance level.
5. Conclusion
Since the p-value is smaller than the significance level, it is evident that H0 must be rejected. This means
that there has been a significant increase in efficiency during the three years.
Confidence Interval
After rejecting H0, we construct a confidence interval in order to confirm within which interval the
population mean change in efficiency, through the past three years, lies.
To construct a 95% confidence interval, we use the following formula and the values given to us by SAS JMP
(see appendix 7):
x ο‚± t 0.025 ( se) where se ο€½
s
n
The sample mean that will be investigated is found to be 0.066003 by using SAS JMP (see appendix 7). The
confidence interval will now be calculated.
se ο€½
0.3096432
338
ο€½ 0.16842369728331
0.066003 ο‚± 1.96 οƒ— 0.016842369728331 ο€½ (0.0329919553;0.0990140447) ο‚» (0.03299;0.099)
Which is approximately the same as the confidence interval given by SAS JMP of (0.03287;0.09913).
Seeing as the confidence interval does not contain zero, we can conclude that there has been a positive
change in efficiency during the last three years. Furthermore we can say with 95% confidence that the
average increase in efficiency lies within the interval of (0.03299;0.099).
Question 5
This question regards fitting a simple linear regression with efficiency as the response variable and logSize
as the explanatory variable. We will start out by discussing whether the assumptions of the model are
violated, and then compute a 95% confidence interval.
Model Assumptions
Page 13 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
The first assumption states that the population satisfies the regression line  y ο€½    x . This is shown in
the data from SAS JMP (see appendix 8) which gives a regression line of y=4.7291198+0.3390093x. This
regression has an R2 value of 0.114002, which states that the correlation between efficiency and logSize is
not very strong, since the value has to be as close to 1 as possible.
The second assumption concerns randomization, and as stated in question 2. The third assumption states
that the population y values at each x value have normal distribution with the same standard deviation at
each x value. The data set approxiametely fulfills the model assumptions.
Further violations of the model could include the sample size, but this is not relevant for this data set.
Confidence Interval
We construct a confidence interval as to determine whether 0 is part of the interval as to conclude whether
x and y are statistically independent. A 95% confidence interval for the slope has the formula:
b ο‚± t 0.025 (se) ο€½ 0.3390092 ο‚± 1.96 οƒ— 0,048042 ο€½ (0.244847,0.433172)
The standard error is supplied by SAS JMP (see appendix 8).
The value of t0.025 is df=n-2=389-2=387, thus t0.025=1.96 (Table B from page A-3)
Since the 95% confidence interval does not contain 0, we can infer that we can reject H0 in a significance
test. As the slope is not equal to 0, the variables are linearly associated meaning that they are dependent
on each other. On average we infer that the maximum increase in efficiency increases by at least 0.244847
and at most 0.433172, for an increase of 1 logSize.
Question 6
a) Additive Model
We will now construct an additive model and explain the most important parts of the output.
The bivariate regression can be extended to a multiple regression equation.
 y ο€½   1 x1   2 x2 ...   n xn
For practical matters we use SAS JMP to calculate the multiple regression model (see appendix 9). We can
explain the most important parts of the output by looking at the prediction expression. The interpretation
of the estimates of β depends on whether the term is categorical or quantitative. If it is quantitative, the
given estimates of β describes what one unit increase in the term adds or subtracts to the overall efficiency.
If it is categorical, only the estimate related to the particular information is used e.g. if the store type is a
consumer durable store 0.33 will be added to the efficiency, whereas the estimate of modern format store
will not be a part of the equation.
Page 14 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
R2
Furthermore we can look at the correlation R2, which SAS JMP has calculated to be 0.26. This indicates that
the multiple regression equation has 26% less error than Θ³ (population mean) which was found to equal
5.476. Additionally, by using the R2 we could look at how much R2 increases as variables are included. It
cannot get smaller, only be the same or increase.
b) Fulfillment of Model Assumptions
Model Assumptions:
First assumption concerns each explanatory variable which has a straight-line relation with μy, with the
same slope for all combinations of values of other predictors in the model. The second assumption states
that the data must be gathered using randomization. The third assumption states that data must have a
normal distribution for y with same standard deviation at each combination of values of other predictors in
the model.
The model assumptions are fairly satisfied, since the standard errors are somewhat equal, with the biggest
difference of about 0.1 standard errors. Furthermore the sample is collected using randomization and the
distributions are approximately normal.
Nonlinear effects
In order to test whether there are any nonlinear effects we use SAS JMP (see appendix 10) to see if any of
the terms have a polynomial tendency. From the data we can see that logSize has a polynomial tendency
since it has a p-value lower than the significance level. On the other hand competition has a linear tendency
since the p-value is a lot higher than the significance level. In regards to the nonlinear tendencies of the
logSize variable, this might lead to violations of the model assumptions, but it cannot be concluded with
certainty that the assumption has been violated.
Interactions
Interactions between two explanatory variables occur when there is a change in the slope of the
relationship between μy and one of the explanatory variables, as the other explanatory variable changes.
There is an interaction when the change in one variable affects the outcome of the other variable. Thus we
can test for interactions in SAS JMP (see appendix 11). If the obtained probability of the crossed estimates
is below the significance level of 0.05 there is an interaction. Thus there is an interaction between logSize
and storetype as the value of 0.0002 is below the significance level.
Page 15 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
As there is only an interaction between logSize and storetype, we would have to eliminate one of them
from the model in order to make it more accurate, since they affect each other.
c) Statistical Significance of Predictors
We could have added the parameters into the model stepwise, as to see whether each parameter increases
R2 and thereby enhancing the predictability of the model. If the value increases by the addition of a
parameter then the parameter is statistically significant. Opposite, if the R2 value remains constant it can be
concluded that the parameter is statistically insignificant. We can obtain the R2 by using the following
formula:
οƒ₯ ( y ο€­ y) ο€­ οƒ₯ ( y ο€­ yˆ )
ο€½
οƒ₯ ( y ο€­ y)
2
R
2
2
2
But we can also look at the effect test in JMP. We set a significance level of 0.05, and conclude from the
above table that the city parameter is not statistically significant since 0.5197>0.05. Opposite, the
parameter of storetype, logSize and competition are all statistically significant, since their p-value<0.05.
Confidence interval:
The 95% confidence interval of the effect of competition is given by the below formula:
πΈπ‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’π‘‘ π‘ π‘™π‘œπ‘π‘’ ± 𝑑0.025 (𝑠𝑒)
The t-score has df = n-number if parameters in regression equation. So df =389-9=380.
Thus the t-score is equal to 1.96, seen from Table B on page A-3 in the book. We obtain the estimated slope
from the “indicator parametization” in JMP and conclude that b=0.5250564.
Page 16 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Thus the 95% confidence interval for the effect of competition is given by the interval (0.2334871,
0.8166256). So the estimated effect of the competition variable on efficiency falls with 95% certainty within
at least 0.2334871 and at most 0.8166256 when increasing competition by 1 unit.
Question 7
a) Logistic Regression Model
We have fitted a logistic regression model in SAS JMP in order to predict the computer usage based on
logSize (see appendix 12). Thus we obtain the below estimates of the model.
Odds ratio:
As seen from the unit odds ratio from SAS JMP in the above graph, the probability of having a computer
with respect to logSize enhances as the store size increases. Since the odds ratio is equal to 0.035066, and
thus below 1, the chance of having a computer increases with respect to store size.
The odds ratio of 0.035066 is corresponding to a 10 fold increase in size, seeing as size is described in
logscale of 10.
Confidence Interval
From SAS JMP (see the above graph) we can see that the confidence interval for the odds ratio is
(0.014086;0.07909). This means that we can be 95% confident that the odds ratio will fall between
0.014086 and 0.07909.
Page 17 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
b) Multiple Regression Model
We will now use SAS JMP to construct a logistic regression with multiple explanatory variables. The graph
below shows the estimates.
Thus the logistic model with the predictors: logSize, Perception, City and StoreType, is given by SAS JMP as
the above parameter estimate.
c) Significance: Perception
We will describe the effect of perception statistically and in the real world. From the below table, it can be
seen that Perception is statistically significant as it is below the significance level (0.0412<0.05).
Just because the value is statistically significant, it does not mean that there is a connection between
perception and computer use. Thus it is important to distinguish between statistical significance and real
world significance as e.g. it does not necessarily make sense that computer use is determined by how
people perceive a store.
Appendix
Appendix 1 – Question 1
Page 18 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 2- Question 2
Appendix 3 – Question 2
Page 19 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 4 – Question 2
Appendix 5 (Question 3)
Page 20 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 6 – Question 3
Appendix 7 – Question 4
Appendix 8 – Question 5
Page 21 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 9 – Question 6
Page 22 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 10 – Question 6B
Page 23 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 11 – Question 6B
Appendix 12 – Question 6B
Page 24 of 25
BSc IBP
Statistics and Research Methods
Winter 13/14
Appendix 13 – Question 7
Page 25 of 25
Download