Assignment in Statistics and Research Methods Indian Retail Stores Winter 2013/2014 Kirsten Marie Simonsen – XXXXXX-XXXXKia Slæbæk Jensen – XXXXXX-XXXX Jasmin Sharzad – XXXXXX-XXXX Malene Louise Thomasen – XXXXXX-XXXX CTU: 25,230 BSc IBP Statistics and Research Methods Winter 13/14 Table of Contents Question 1 ......................................................................................................................................................... 3 Question 2 ......................................................................................................................................................... 4 Two Group Comparison................................................................................................................................. 4 Three Group Comparison .............................................................................................................................. 6 Question 3 ......................................................................................................................................................... 9 Two Group Comparison................................................................................................................................. 9 Three Group Comparison ............................................................................................................................ 11 Question 4 ....................................................................................................................................................... 12 Question 5 ....................................................................................................................................................... 13 Question 6 ....................................................................................................................................................... 14 a) Additive Model ........................................................................................................................................ 14 b) Fulfillment of Model Assumptions .......................................................................................................... 15 c) Statistical Significance of Predictors ........................................................................................................ 16 Question 7 ....................................................................................................................................................... 17 a) Logistic Regression Model ....................................................................................................................... 17 b) Multiple Regression Model ..................................................................................................................... 18 c) Significance: Perception .......................................................................................................................... 18 Page 2 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Question 1 This question concerns a description of the variables in the data set. Furthermore, the minimum and maximum store size will be calculated in both square feet and square meter. Variables are the characteristics observed in a study. Thus, the variables in the data file from the research papers: “Competition and labor productivity in India’s retail stores” and “Are labor regulations driving computer usage in India’s retail stores?” are LogSize, Competition, Perception, Efficiency3yr, Efficiency, Logsales3yr, Logsales, Store type, City and Computer use. A variable can be either quantitative or categorical. The quantitative variables in the data set are LogSize, Competition, Perception, Efficiency3yr, Efficiency, Logsales3yr and Logsales, since their values can take any value within a certain interval. Furthermore the variables are continuous. LogSize, Logsales3yr and Logsales have been converted into base 10 logarithm in order to diminish the spread of the data set. The categorical variables in the data set are: Store type, City and Computer use as they all consist of categories for which the concerned observation belongs to a certain category. As for making descriptive statistics of quantitative and categorical variables graphs and numerical summaries describe the main features of the variables. For quantitative variables, key features to describe are the center and the variability, why a histogram is usually applied as to describe the data. Opposite, for categorical variables, a key feature to describe is the relative number of observations in the various categories. Consequently, a bar graph is commonly used as to describe the data. Minimum and maximum We will now describe the minimum and maximum of the logSize variable. As the store size will be calculated in square feet we will lastly convert it into square meter. By constructing a histogram in SAS JMP (see appendix 1), we get the maximum and minimum values of the logSize variable. Since the data is given in the log(10) base we use the following formula x=logy to obtain the minimum and maximum values in square feet: Min: 101,079 = 11.995 sq.ft. Max: 104,176 = 14996.848 sq.ft. These values can be converted into square meter by using the following formula: Page 3 of 25 BSc IBP m2 ο½ Statistics and Research Methods Winter 13/14 ft 2 10.764 So by inserting the above results into the formula we obtain the following results: 11.995 Min: 10.764 = 1.114 sq.m Max: 14,996.848 10.764 = 1393.241 sq.m Thus it can be concluded that there is a wide spread in the LogSize variable, meaning that there is a large difference in the store sizes. Question 2 This question regards a comparison of the store types in terms of their level of efficiency. We will begin with constructing a significance test and a confidence interval for the two group comparison, followed by a significance test and a confidence interval for the three group comparison. Two Group Comparison Significance Test 1. Assumptions Since we are comparing means, the first assumption of the significance test is that the response variable has to be quantitative, which is the case of the efficiency variable. The second assumption suggests that the sample must be collected using randomization. The sampling methodology for Enterprise Surveys is stratified random sampling. In a stratified random sample, all population units are grouped within homogeneous groups and simple random samples are selected within each group. The third assumption states that each group must have an approximately normal distribution. By looking at the histograms (see appendix 2) for each group, it can be observed that this is the case of both store types. 2. Hypotheses The null hypothesis assumes that the two means are equal, and thus there is no difference in the level of efficiency: π»0 : π1 = π2 The two-sided alternative hypothesis suggests that the means are different from one another, and thereby implying an association between efficiency and store type: π»π : π1 ≠ π2 Page 4 of 25 BSc IBP Statistics and Research Methods Winter 13/14 3. Test Statistic The test statistic states the distance between the value of the null hypothesis and the point estimate parameter, wherewith the amount of standard errors determines this distance. The test statistic has approximately a t distribution if H0 is true. The following formula is used to construct the test statistic: tο½ ( x1 ο x 2 ) ο 0 se The standard error can be calculated by using the following formula: se ο½ s12 s 22 ο« n1 n 2 The standard error is then derived to be: se ο½ s12 s 22 ο« ο½ n1 n2 0.450465 2 0.45865 2 ο« ο½ 0.061161988 ο» 0.06 68 278 And the t test statistic is calculated to be: tο½ (5.7624265 ο 5.3656043) ο 0 ο½ 6.488052677 ο» 6.488 0.061161988 4. P-Value In order to infer the value from the test statistic, the p-value describes the probability that the test statistic takes the observed value or a value more extreme. In this case the two-tail probability from the t distribution will be used to construct the p-value, which has to be smaller than the significance level of α=0.05 if H0 is to be rejected. We use table B on page A-3. Since df is larger than 100 and thereby approximates infinity, we can conclude from the table, that when we have a t test statistic of 6.488, the right tail probability must be significantly smaller than 0.001 which can be seen in table B. The P-value must therefore be smaller than 0.002. 5. Conclusion Seeing as the p-value is smaller than the significance level of 0.05 we can reject H0. Therefore we can conclude that there is a difference in efficiency between the two store types. Page 5 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Confidence Interval Since the response variable is quantitative, we are comparing means. Using SAS JMP (see appendix 2) to obtain the means, we have constructed a confidence interval for the difference in efficiency between the means of the store types Traditional and Consumer Durables. The difference in means will be calculated by using the following formula for constructing a confidence interval: ( x1 ο x2 ) ο± t 0.025 ο se Here df = n – 1, for the t-score t0.025 equals df = (68+278)-1 = 345. Since the degrees of freedom exceeds 100 we use t0.025 = 1.96. Where the standard error (se) is calculated to be: se ο½ οse( x )ο ο« οse( x )ο 2 1 2 2 ο½ s12 s 22 ο« ο½ n1 n2 0.450465 2 0.45865 2 ο« ο½ 0.061161988 ο» 0.06 68 278 The confidence interval is then: (5.7624265ο 5.3656043) ο± 1.96 ο 0.06 ο½ 0.3968222ο± 0.1176 ο½ (0.2792222;0.5144222) ο» (0.279;0.514) Thus we can be 95% confident that the population mean difference for efficiency (μ1 – μ2) between Consumer Durable Stores and Traditional FMCGs falls between 0.279 and 0.514. We can then infer that the efficiency in the Consumer Durable Stores is between 0.279 and 0.514 points larger than the efficiency in the Traditional FMCGs measured on a scale from 1 to 4. Three Group Comparison Significance Test When comparing several means, the analysis of variance method is used by constructing a one-way ANOVA test (see appendix 3). It investigates independence between efficiency and the three store types. 1. Assumptions The first assumption states that the distributions for the groups are normal, where the standard deviations for each group are the same. The standard deviations are not completely identical, with a difference from the largest to the smallest of 0.064, but since the difference is smaller than 2, the general formula can be used for calculation. Additionally the second assumption assumes randomization, which is fulfilled(see two group comparison). Page 6 of 25 BSc IBP Statistics and Research Methods Winter 13/14 2. Hypotheses The null hypothesis suggests that the means for each group are equal, thus there is no difference in the level of efficiency between the store types. π»0 : π1 = π2 = . . . = ππ The one-sided alternative hypothesis states that at least two of the means are unequal, thus suggesting an association between efficiency and store type. 3. Test Statistic The ANOVA test has an F distribution, with two degrees of freedom values, which has a mean equal to approximately 1, when H0 is true. F test statistic: πΉ= π΅ππ‘π€πππ πΊπππ’ππ − ππππππππππ‘π¦ 5.93626 = = 29.2303 ≈ 29.23 πππ‘βππ πΊπππ’ππ − ππππππππππ‘π¦ 0.20309 4. P-Value The p-value is the right-tail probability from the F distribution ππ1 = (g − 1) = 3 − 1 = 2 ππ2 = (n − g) = 388 – 3 = 385 Using Table D on page A-5 we are able to find that, if the F test statistic is 3 or above then the right-tale probability must equal 0.05 or smaller. Seeing as the F test statistic is 29.23, it is far larger than 3, therefore the P-value must be much smaller than the significance level 0.05. 5. Conclusion JMP reports the P-value to be smaller than 0.001, therefore we can reject the null hypothesis. Thus is can be concluded that at least two of the groups have different means. Confidence Interval The confidence interval is used to estimate the differences between population means. The following formula is used to construct a 95% confidence interval for two groups with different: yi ο y j ο± t 0.025 s 1 1 ο« ni n j The t score has df = N – g We will use the formula to obtain the confidence interval for the difference in the means for the three store types. Page 7 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Consumer Durable Stores (CDS) vs. Modern Format Stores (MFS) Common standard deviation from Mean Square Error, we got from JMP (see appendix 4): s ο½ MS ο½ 0.20309 ο½ 0.4506550787 yCDS ο y MFS ο± t 0.025 s 1 nCDS ο« 1 ο½ 5.7624265 ο 5.7359535 ο± 1.96 ο 0.4506550787 nMFS 1 1 ο« ο½ 68 43 (ο0.1456239392;0.1985699392) ο» (ο0.146;0.199) Since the 95% confidence interval (-0.146; 0.199) contains zero we cannot reject that the mean of CDS equals the mean of MFS. MFS vs. Traditional FMCGs (TFMCG) y MFS ο yTFMCG ο± t 0.025s 1 nMFS ο« 1 nTFMCG ο½ 5.7359535 ο 5.3656043 ο± 1.96 ο 0.4506550787 1 1 ο« ο½ 43 278 (0.225606646;0.0.515091754) ο» (0.226;0.515) Seeing as the confidence interval does not contain zero, we can conclude that the two means differ, and that the efficiency of the Modern Format Stores will be between 0.226 and 0.515 larger than the efficiency in the Traditional FMCGs. CDS vs. Traditional FMCGs As we have already made a confidence interval for the difference between Consumer Durable Stores and the Traditional FMCGs, concluded that with a 95 % confidence the difference between the two types of store’s population mean lies in the interval (0.279; 0.514). Here it is also evident that the Consumer Durable Stores are more efficient than the Traditional FMCGs. Since the confidence interval for the Consumer Durable Stores and the Modern Format Stores contains zero, we cannot reject that their population means might equal each other, so they might be equally efficient. However, we can conclude that the Traditional FMGCs are statistically significantly less efficient than the Consumer Durable Stores and the Modern Format Stores. Page 8 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Question 3 This question considers a comparison of the probability of computer use in relation to store type. Two Group Comparison Significance Test 1. Assumptions Since this question concerns a comparison of proportions, the first assumption states that the response variable for the two groups has to be categorical. The second assumption is concerned with the sample size, which has to be large enough so that there are at least five successes and five failures for each of the two groups. The third assumption states that the data must be collected by using randomization. All the assumptions are fulfilled. 2. Hypotheses The null hypothesis assumes that the two proportions are equal, thereby suggesting that there is no association, π»0 : π1 = π2 The two-sided alternative hypothesis assumes that the two proportions are different from one another, thereby suggesting that there is an association between computer use and store type, π»π : π1 ≠ π2 3. Test Statistic When comparing proportions we do a z statistic, which describes the distance between the sample estimate and the null hypothesis value, measured in the number of standard errors. We obtained the proportion of computer use in the different store types by using SAS JMP (see appendix 5). se ο½ pˆ 1 (1 ο pˆ 1 ) pˆ 2 (1 ο pˆ 2 ) ο« n1 n2 se ο½ 0.0537(1 ο 0.0537) 0.0311(1 ο 0.0311) ο« ο½ 0.028674012 ο» 0.03 71 283 zο½ ( pˆ 1 ο pˆ 2 ο 0) se zο½ (0.0537 ο 0.0311 ο 0) ο½ 0.789 0.03 4. P-Value Page 9 of 25 BSc IBP Statistics and Research Methods Winter 13/14 The P-value is the two tail probability of a normal distribution which can be obtained from Table A on page A-2 in the book. We obtain a cumulative probability of 0.7852, but since it is the area under the standard normal curve to the left of z, and we are interested in the probability to the right of z, we subtract the cumulative probability from 1. But since we need the two tail probability from the standard normal distribution of values even more extreme than the observed z test statistic, we multiply by 2 and get the following p-value. (1 ο z ) ο 2 ο½ (1 ο 0.7852) ο 2 ο½ 0.4294 5. Conclusion We can conclude that since the P-value is larger than the significance level, we cannot reject the null hypothesis. Thus there is not necessarily an association between computer and store type concerning traditional and consumer durable stores. NOTE We made a huge mistake when handling in the assignment, which we saw afterwards. We should have used the pooled estimate! When doing this we got a new z-score which is 6.19 with (p<0,00001) Calculations are: pooled e = (19+11)/(71+283) = 0.084746 se = kvad. rod(0.08476 * (1 - 0.08476) * (1/71+1/283)) z = (0.267-0.0389-0) / (kvad. rod(0.08476 * (1 - 0.08476) * (1/71+1/283))) z = 6.18 The P-value is extremely low so we reject H0, which means that our conclusion is that there IS a connection between computer use and storetype Confidence Interval We will now construct a confidence interval, by using the following formulas for finding the standard error and the confidence interval. ( pˆ 1 ο pˆ 2 ) ο± z 0.025 ( se) se ο½ pˆ 1 (1 ο pˆ 1 ) pˆ 2 (1 ο pˆ 2 ) ο« n1 n2 Page 10 of 25 BSc IBP Statistics and Research Methods Winter 13/14 0.0537(1 ο 0.0537) 0.0311(1 ο 0.0311) ο« ο½ 0.028674012 ο» 0.03 71 283 se ο½ (0.0537 ο 0.0311) ο± 1.96 ο 0.03 ο½ 0.0226 ο± 0.0588 ο½ (ο0.0362;0.0814) ˆ1 ο pˆ 2 ) equals zero, which means that the Since the confidence interval contains zero, it is plausible that ( p population proportions might be equal. This indicates that there is not necessarily an association between computer use and Traditional FMCGs or Consumer Durable Stores. NOTE Given the last note, we have calculated a wrong CI, as we should have used conditional proportions (row% in JMP). (0.1232 ; 0.334) (NOTE it does NOT contain 0!!!) Calculations: se = kvad. rod (0.2676 * (1-0.2676) / 71 + (0.0389 * (1 - 0.0389) / 283)) = 0.0537 CI er da: (0.02676 - 0.0389) +-1.96 * 0.0537 Our Ci tells us that there is approx. 12% to 33% bigger chance of having a computer in a consumer durable store than in a traditional FMCG. Three Group Comparison Chi Squared Test This is a test of independence, which compares the observed counts with the expected counts, by looking at the difference between the two and summarizing the squares. When H0 is true, thus assuming independence, the Chi squared test has a small value, and if there is an association, the chi squared test has a large value. X2=∑ (πππ πππ£ππ πΆππ’ππ‘−πΈπ₯ππππ‘ππ πΆππ’ππ‘)2 πΈπ₯ππππ‘ππ πΆππ’ππ‘ The expected cell count can be calculated by using: πΈπ₯ππππ‘ππ πΆπππ πΆππ’ππ‘ = (π ππ€ πππ‘ππ) · (πΆπππ’ππ πππ‘ππ) πππ‘ππ ππππππ πππ§π SAS JMP (see appendix 6) provides the Chi Squared value (Pearson) to be 116.147. Since this is a rather large number, we can conclude that there is an association between the three store types and computer Page 11 of 25 BSc IBP Statistics and Research Methods Winter 13/14 use. Seeing as our two group comparison between Traditional FMCGs and Consumer Durable Stores gave us a result where there was not necessarily an association, there must be a bigger chance of having a computer in a Modern Format Store. Question 4 This question concerns a comparison between the two variables Efficiency3yr and Efficiency. In order to compare the two variables, we construct a new column in SAS JMP called EfficiencyEfficiency3yr, for which we get the change in efficiency. This will be followed by a significance test and a confidence interval, in order to confirm whether or not there has been a significant change in efficiency. Significance Test 1. Assumptions Three assumptions apply to the comparison of means. The first assumption states that the variables need to be quantitative, which is the case of the efficiency variables. The second assumption states that the data set has to be collected using randomization. The third assumption states that the population distribution has to be approximately normal. All the assumptions are fulfilled. 2. Hypothesis The null hypothesis assumes that the means are equal, which indicates that there is no change in efficiency during the three years. π»0 : π = π0 The two-sided alternative hypothesis assumes that the means differ, thus indicating that there has been a change in efficiency during the three years: π»0 : π ≠ π0 3. Test Statistic The test statistic uses the standard error to measure the distance between the sample mean x (see appendix 7) and the value of the null hypothesis μ0. The following formula can be used to calculate the t test statistic: tο½ ο¨x ο ο 0 ο© ο¨x ο ο 0 ο© se ο½ s/ n = ο¨0.066003ο 0ο© 0.3096432/ 338 ο½ 3.918866589 ο» 3.92 4. P-Value The p-value is a two tail probability of getting more extreme values than the t test statistic. We can obtain the right tail probability of the t test statistic by using table B on page A-3. As df= n-1, df is larger than 100 and thereby approximates infinity, we can conclude from the table, that when we have a t test statistic of Page 12 of 25 BSc IBP Statistics and Research Methods Winter 13/14 3.92, the right tale probability must be significantly smaller than 0.001. The P-value must therefore be smaller than 0.002, why we can reject H0 as the P-value is smaller than the significance level of α=0.05. Furthermore, SAS JMP provides us with the p-value of 0.0001, which is significantly smaller than the significance level. 5. Conclusion Since the p-value is smaller than the significance level, it is evident that H0 must be rejected. This means that there has been a significant increase in efficiency during the three years. Confidence Interval After rejecting H0, we construct a confidence interval in order to confirm within which interval the population mean change in efficiency, through the past three years, lies. To construct a 95% confidence interval, we use the following formula and the values given to us by SAS JMP (see appendix 7): x ο± t 0.025 ( se) where se ο½ s n The sample mean that will be investigated is found to be 0.066003 by using SAS JMP (see appendix 7). The confidence interval will now be calculated. se ο½ 0.3096432 338 ο½ 0.16842369728331 0.066003 ο± 1.96 ο 0.016842369728331 ο½ (0.0329919553;0.0990140447) ο» (0.03299;0.099) Which is approximately the same as the confidence interval given by SAS JMP of (0.03287;0.09913). Seeing as the confidence interval does not contain zero, we can conclude that there has been a positive change in efficiency during the last three years. Furthermore we can say with 95% confidence that the average increase in efficiency lies within the interval of (0.03299;0.099). Question 5 This question regards fitting a simple linear regression with efficiency as the response variable and logSize as the explanatory variable. We will start out by discussing whether the assumptions of the model are violated, and then compute a 95% confidence interval. Model Assumptions Page 13 of 25 BSc IBP Statistics and Research Methods Winter 13/14 The first assumption states that the population satisfies the regression line ο y ο½ ο‘ ο« ο’ x . This is shown in the data from SAS JMP (see appendix 8) which gives a regression line of y=4.7291198+0.3390093x. This regression has an R2 value of 0.114002, which states that the correlation between efficiency and logSize is not very strong, since the value has to be as close to 1 as possible. The second assumption concerns randomization, and as stated in question 2. The third assumption states that the population y values at each x value have normal distribution with the same standard deviation at each x value. The data set approxiametely fulfills the model assumptions. Further violations of the model could include the sample size, but this is not relevant for this data set. Confidence Interval We construct a confidence interval as to determine whether 0 is part of the interval as to conclude whether x and y are statistically independent. A 95% confidence interval for the slope has the formula: b ο± t 0.025 (se) ο½ 0.3390092 ο± 1.96 ο 0,048042 ο½ (0.244847,0.433172) The standard error is supplied by SAS JMP (see appendix 8). The value of t0.025 is df=n-2=389-2=387, thus t0.025=1.96 (Table B from page A-3) Since the 95% confidence interval does not contain 0, we can infer that we can reject H0 in a significance test. As the slope is not equal to 0, the variables are linearly associated meaning that they are dependent on each other. On average we infer that the maximum increase in efficiency increases by at least 0.244847 and at most 0.433172, for an increase of 1 logSize. Question 6 a) Additive Model We will now construct an additive model and explain the most important parts of the output. The bivariate regression can be extended to a multiple regression equation. ο y ο½ ο‘ ο« ο’1 x1 ο« ο’ 2 x2 ... ο« ο’ n xn For practical matters we use SAS JMP to calculate the multiple regression model (see appendix 9). We can explain the most important parts of the output by looking at the prediction expression. The interpretation of the estimates of β depends on whether the term is categorical or quantitative. If it is quantitative, the given estimates of β describes what one unit increase in the term adds or subtracts to the overall efficiency. If it is categorical, only the estimate related to the particular information is used e.g. if the store type is a consumer durable store 0.33 will be added to the efficiency, whereas the estimate of modern format store will not be a part of the equation. Page 14 of 25 BSc IBP Statistics and Research Methods Winter 13/14 R2 Furthermore we can look at the correlation R2, which SAS JMP has calculated to be 0.26. This indicates that the multiple regression equation has 26% less error than Θ³ (population mean) which was found to equal 5.476. Additionally, by using the R2 we could look at how much R2 increases as variables are included. It cannot get smaller, only be the same or increase. b) Fulfillment of Model Assumptions Model Assumptions: First assumption concerns each explanatory variable which has a straight-line relation with μy, with the same slope for all combinations of values of other predictors in the model. The second assumption states that the data must be gathered using randomization. The third assumption states that data must have a normal distribution for y with same standard deviation at each combination of values of other predictors in the model. The model assumptions are fairly satisfied, since the standard errors are somewhat equal, with the biggest difference of about 0.1 standard errors. Furthermore the sample is collected using randomization and the distributions are approximately normal. Nonlinear effects In order to test whether there are any nonlinear effects we use SAS JMP (see appendix 10) to see if any of the terms have a polynomial tendency. From the data we can see that logSize has a polynomial tendency since it has a p-value lower than the significance level. On the other hand competition has a linear tendency since the p-value is a lot higher than the significance level. In regards to the nonlinear tendencies of the logSize variable, this might lead to violations of the model assumptions, but it cannot be concluded with certainty that the assumption has been violated. Interactions Interactions between two explanatory variables occur when there is a change in the slope of the relationship between μy and one of the explanatory variables, as the other explanatory variable changes. There is an interaction when the change in one variable affects the outcome of the other variable. Thus we can test for interactions in SAS JMP (see appendix 11). If the obtained probability of the crossed estimates is below the significance level of 0.05 there is an interaction. Thus there is an interaction between logSize and storetype as the value of 0.0002 is below the significance level. Page 15 of 25 BSc IBP Statistics and Research Methods Winter 13/14 As there is only an interaction between logSize and storetype, we would have to eliminate one of them from the model in order to make it more accurate, since they affect each other. c) Statistical Significance of Predictors We could have added the parameters into the model stepwise, as to see whether each parameter increases R2 and thereby enhancing the predictability of the model. If the value increases by the addition of a parameter then the parameter is statistically significant. Opposite, if the R2 value remains constant it can be concluded that the parameter is statistically insignificant. We can obtain the R2 by using the following formula: ο₯ ( y ο y) ο ο₯ ( y ο yˆ ) ο½ ο₯ ( y ο y) 2 R 2 2 2 But we can also look at the effect test in JMP. We set a significance level of 0.05, and conclude from the above table that the city parameter is not statistically significant since 0.5197>0.05. Opposite, the parameter of storetype, logSize and competition are all statistically significant, since their p-value<0.05. Confidence interval: The 95% confidence interval of the effect of competition is given by the below formula: πΈπ π‘ππππ‘ππ π ππππ ± π‘0.025 (π π) The t-score has df = n-number if parameters in regression equation. So df =389-9=380. Thus the t-score is equal to 1.96, seen from Table B on page A-3 in the book. We obtain the estimated slope from the “indicator parametization” in JMP and conclude that b=0.5250564. Page 16 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Thus the 95% confidence interval for the effect of competition is given by the interval (0.2334871, 0.8166256). So the estimated effect of the competition variable on efficiency falls with 95% certainty within at least 0.2334871 and at most 0.8166256 when increasing competition by 1 unit. Question 7 a) Logistic Regression Model We have fitted a logistic regression model in SAS JMP in order to predict the computer usage based on logSize (see appendix 12). Thus we obtain the below estimates of the model. Odds ratio: As seen from the unit odds ratio from SAS JMP in the above graph, the probability of having a computer with respect to logSize enhances as the store size increases. Since the odds ratio is equal to 0.035066, and thus below 1, the chance of having a computer increases with respect to store size. The odds ratio of 0.035066 is corresponding to a 10 fold increase in size, seeing as size is described in logscale of 10. Confidence Interval From SAS JMP (see the above graph) we can see that the confidence interval for the odds ratio is (0.014086;0.07909). This means that we can be 95% confident that the odds ratio will fall between 0.014086 and 0.07909. Page 17 of 25 BSc IBP Statistics and Research Methods Winter 13/14 b) Multiple Regression Model We will now use SAS JMP to construct a logistic regression with multiple explanatory variables. The graph below shows the estimates. Thus the logistic model with the predictors: logSize, Perception, City and StoreType, is given by SAS JMP as the above parameter estimate. c) Significance: Perception We will describe the effect of perception statistically and in the real world. From the below table, it can be seen that Perception is statistically significant as it is below the significance level (0.0412<0.05). Just because the value is statistically significant, it does not mean that there is a connection between perception and computer use. Thus it is important to distinguish between statistical significance and real world significance as e.g. it does not necessarily make sense that computer use is determined by how people perceive a store. Appendix Appendix 1 – Question 1 Page 18 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 2- Question 2 Appendix 3 – Question 2 Page 19 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 4 – Question 2 Appendix 5 (Question 3) Page 20 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 6 – Question 3 Appendix 7 – Question 4 Appendix 8 – Question 5 Page 21 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 9 – Question 6 Page 22 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 10 – Question 6B Page 23 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 11 – Question 6B Appendix 12 – Question 6B Page 24 of 25 BSc IBP Statistics and Research Methods Winter 13/14 Appendix 13 – Question 7 Page 25 of 25