STOCKHOLM UNIVERSITY Department of Statistics Fall 2022 Cover page: Hand-in Assignment 3, Basic Statistics for Economists 3. Econometrics Assignments’ teacher: Group (1-10): Assignments’ Group (1-15): Data: Part 3A: Data 7 (from Assignment 1) Part 3B: Data 1, Beer production Australia Note! Always save your own version of the report Group Members: Name: Date of birth: E-mail Result after first deadline: Pass Fail Comments: Results after the second deadline: Pass Comments: Fail Part A Regression analysis 1. a) The highest absolute correlation with the dependent variable (the number of items sold) is the independent variable called “Price difference”, which is 0,884926419. This is shown in a scatter plot diagram below. In the scatter plot, the price difference is represented by the x-axis and the number of items sold is represented by the y-axis. The number of items sold increases when the price difference itself increases and this result is obtained when the competitor's price is subtracted from the retailer's price. One can also see that the correlation coefficient is close to 1, which can be an indication that it is a strong positive linear relationship. b) The lowest absolute correlation with the dependent variable (the number of items sold) is the independent variable called “Retailer’s price”, which is -0,079295618. This is shown in a scatter plot diagram below. Just as before, in the scatter plot, the retailer’s price is represented by the x-axis and the number of items sold is represented by the y-axis. In this diagram, one can see that the correlation coefficient is 1 close to “0”, which indicates no correlation. This means there is no linear relationship between the two variables. In other words, the variables do not depend on each other and therefore, the linear relationship in this scatter plot is almost non-existent. 2. Note that: The chosen independent variable: Price difference The chosen dependent variable: Quantity - number of items sold a) In the simple linear regression model above, the coefficient of determination (R2) = 0,783094767. This could be an explanation of the variation from the price difference, which in turn means that approximately 78,31% of the variance in the dependent variable can be explained by the independent variable. A larger value of R 2 could indicate a better regression model, and 78,3% could be considered a high value (Newbold et al. 2020, p.439). It is possible to calculate the R2 value without estimating the regression model. It can be calculated with the help of the formulas below: (Formula sheet, p.3) b) The linear equation used in the regression model: . The quantity numbers of items sold will have the intercept coefficient Bo when the Price difference = 0. This in turn means that the quantity - number of items sold will have an average value of 1435,43 when the price difference is 0. Additionally, the regression coefficient B↿ is an indication of the slope of the regression line. So when the coefficient increases with 1 unit, the quantity - number of items sold will have an average increase of 69,63 (Newbold et al. 2020, p.421-423). 2 c) The testing of the regression coefficient is done through a formal hypothesis test. Note that this test is a two-sided-test which means that the absolute value can be used in the testing and it also gives a possibility for the slope (slope = B↿) to vary from 0 in different ways (positive vs negative or both). Also, the significance level for testing is at a 5% level (alpha = 0,05), divided in two as it is a two-sided test. That gives a significance level testing of 5%/2 = 2,5% (alpha = 0,025). Observations: n = 30 Hypotheses: The null hypothesis formulated as: H૦ : B↿= 0 H↿ : B↿ ≠ 1 (Newbold et al, 2020, p.351-354). Test Statistic: b1:69,62950823, β1*:0, Sb1= 6,925357, K:1 (regression), t(n-K-1): (b1- β1*)/Sb1 Critical Value T (n - 2,α/2) = t (28, 0,025) = 2,048 (from Table 3 in the formula sheet with alpha 0,025 and 28 degrees of freedom). T (30-2, α/2 = t (28,0,025) = 2,048 Decision Rule b1 = 69,62950823, β1* = 0, Sb1= 6,925357, T(n-K-1) = 30-1-1 = 28 If T obs; t(n - K -1; 0,025) is > T critical value; (T (n - 2,α/2) then a rejection of the null hypothesis can be made. T obs (28) = 69,62950823 / 6,925357 = 10,0542843 ≈ 10,05 (28, according to table 3 in the formula sheet) Conclusion: The null hypothesis can be rejected since the observed value (= 10,05) is higher than the critical value (= 2,048). One explanation of this is that the “Price Difference” is correlated with the number of “ quantity - number of items sold” at the level of significance of 5%. d) b0: 1435,42731 (from regression model) b1: 69,62950823 (from regression model) b1X: 69,62950823*3 = 1644,3158 T(n - 2,α/2) = t(28,0.025) = 2,048 n : 30 K:1 xbar: =()Mean price difference in Excel = 1,24 s2x: =()SVariance price difference in Excel = 3,09 X = Competitor’s price – Retailer's price = 31 SEK – 28 SEK = 3 SEK = 120180,682/30-1-1= 4292,1672 3 = 4292,1672*((1+1/30)+ ((3-1,24)^2)/((30-1)*3,09)))) = (1435,42731+69,62950823*3) ± 2,048√4292,1672*((1+1/30)+ ((3-1,24)^2)/((30-1)*3,09)))) = 1644,3158 ±138,6544285 X1 = 1644,3158 + 138,6544285 = 1782,9703 X2 = 1644,3158 – 138,6544285 = 1505,6614 Conclusion: The interval of the values is between 1782,9703 and 1505,6614 which indicates a result of a 95% prediction level given that x = 3. e) Calculation: b0: 1435,42731 (from regression model) b1: 69,62950823 (from regression model) b1X: 69,62950823*3 = 1644,3158 T(n - 2,α/2) = t(28,0.025) = 2,048 n : 30 K:1 xbar: =()Mean price difference in excel = 1,24 s2x: =()SVariance price difference in excel = 3,09 X = Competitor’s price – Retailer's price = 31 SEK – 28 SEK = 3 SEK = 120180,682/(30-1-1)= 4292,1672 = 4292,1672 ( (1/30)+ ((3-1,24)^2)/((30-1)*3,09) = 291,07269 = (1435,42731+69,62950823*3) ± 2,048√4292,1672*((1/30)+ ((3-1,24)^2)/((30-1)*3,09)))) = 1644,3158 ± 34,9627872 - X1 = 1644,3158 + 34,9627872 = 1679,2786 - X2 = 1644,3158 - 34,9627872 = 1609,3530 Conclusion: The interval of the values between 1679,2786 and 1609,3530 indicates a result of a 95% confidence level given that x = 3. 3 a) The three variables in the data set that are linear combinations of each other are: Competitor’s price, retailer’s price and price difference. The price difference is computed through: Competitor’s price minus the retailer’s price. 4 b) The three variables above should not be used as independent variables in the same model because of a multicollinearity that could arise, which occurs when variables like these are correlated. Therefore, a model containing variables that are closely correlated to one another would be misleading as it could weaken the significance of the statistics and give rise to incorrect calculations. There is also a chance that the standard errors could be of a higher level, due to the multicollinearity that could arise when using datasets who are linear combinations of each other (Newbold et al. 2020, p. 578-580) 4. For this question, the same dataset as previously is used. The dependent variable is the number of items sold and the three independent variables are retailers prices, price difference and advertising costs. Fig 4.1 a) i) The coefficient of determination (R2) is approximately 0,8477 which in other words means that 84.77% of the dependent variable can be explained by the chosen independent variables. In addition, the high coefficient of determination value shows that the model is a good fit for the data. ii) The coefficient of determination of the multiple regression model (R2 = 0.8477) is higher than that for the linear regression model (R2 = 0.7831). Compared to question 2b, there was an R2 of 78.31% which was lower. The higher R2 in the second model could be explained by the fact that there are more independent variables and thus more data to explain the dependent variable, and without analyzing the data, this should be expected. Nevertheless, more variables are not necessarily better. There could be a situation in which more variables means a lower coefficient of determination where the independent variable does not contribute to forecasting the dependent variable. iii) An alternative to the coefficient of determination, is the adjusted R Squared ( ) which in this case is a similar measure but adjusted for adding more variables and more data to the dataset. It aims to eliminate false correlations between the variables so that it will take into account the true correlation. It is therefore fitting in this case for multiple regression where there are multiple variables. 5 b) The coefficient for the retailer's price is 46.13. This indicates that every 1 unit increase in retailer’s price will lead to 46.13 units increase in the sold items if both advertising costs and price difference are held constant. The coefficient for advertising cost is -0.042. This means that every 1 unit increase in advertising costs will lead to a 0.04 units decrease in the sold items if both retailer’s price and price difference are held constant. The coefficient for price difference is 91.87. This indicates that every 1 unit increase in the price difference will lead to 91.87 units increase in the sold items if both advertising costs and retailer’s price are held constant. The intercept value is equal to 420.68 and this means that the predicted value of items sold is 420.68 if all independent variables within this model are equal to zero. This means that when advertising cost as well as retailer’s price and price difference equal zero. c) Yes, the 95% confidence interval for the advertising costs coefficient contains the value zero as the values are -0.099 and 0.016. This means that were the experiment to be run again, there would be a high chance of finding no correlation in the data. Which, in other words, implies that the actual coefficient value can be zero, meaning that the predictor has no relationship with the dependent variable. d) Fig 4.2 Number of Items sold: (28*46.12989498)+(3*91.87085393)+(2500*-0.041838779)+420.6837777=1883.33645143 =1883.34 Given the assumed values for the independent variables, the output for the dependent variable is approximately 1883. The number of items sold will therefore be 1883. 5) A company is most interested in earnings thus if one wishes to analyze a company from a business analytical perspective this is what should be the central focus. The figure which is to be examined in the model is items sold since there is a relationship between the number of items sold and earnings. However, one ought to analyze earnings, as it does not matter how many items are sold when it does not impact the business's earnings. Thus sales and earnings would be of most interest. Earnings are the product of the number of sold items and the price which the items are sold for adjusted for the acquisition cost and overheads. So the core of the analysis should be the relationship between these items. The model as of now is focusing on sold items as the core. The optimal analysis would plot earnings against sold items, selling price and related costs. In this case, it is assumed that profitability is the ultimate goal. If one breaks down decision-making in strategic parts, there might be subgoals that contribute 6 towards profitability, but in themselves do not express profit. The point made is that one should analyze the variables that contribute to the end goal which is higher earnings. This can also be expressed as follows : if one considers the dependent variable the end goal, then one should identify independent variables which contribute to a high R2. In other words, one wants to use independent variables that contribute well to plotting the dependent one. In a business situation, this would also have the added benefit of identifying divers for earnings if one uses that as the dependent variable. It should also be taken into account that one might not know all the variables which are important in our analysis. If one were to omit an important value, the relationship between the others would not correspond well to reality and the outputs of our model would be misleading (Newbold et al. 2020, p. 575). If earnings were to be forecasted without using the price of sold items, the relationships, if one would analyze it, would only take into account sold items and related costs. This could forecast earnings to some extent if one assumes an increase in sold items by itself leads to higher earnings and in the same way lower cost means higher earnings. The fact is, however, that if the price decreased this would change the situation and not guarantee the assumptions just mentioned. This demonstrates that the model loses accuracy if one omits an important independent variable and the impact on accuracy depends on the significance of the variable. To conclude, it can be said that when one variable were to be analyzed there would only be one relationship. Profit is dependent on the amount of goods sold, price and cost associated, and if you sell more, higher profit can be assumed. If the equation for earnings is sold items by price, then a higher price would mean higher earnings and it is also important to take into consideration that lower costs would also result in higher earnings. That, however, would assume that the other variables are unmoving. As posed in the question, management has asked for the usefulness of the model to be discussed. When discussing the model with management it ought to be argued that while one can see strong advantages with the given model based on the “number of items sold” it would be appropriate to investigate further into other variables that lead to higher profitability, too. There could still be bias if not investigated properly, however, including more variables would allow for a more conclusive picture to be formed. 7 Part B 1. The data set chosen is quarterly beer production in Australia and it is measured in mega liters. In the diagram above, time is on the x-axis where each quarter of each year is shown and the total beer production is on the y-axis. The data set and time-series diagram have in total 48 points and it spans over twelve years from January 1957 until December 1968. Each point on the diagram corresponds to the quarterly beer production in Australia. 2. A positive upward trend and seasonal variations are present in the time-series graph. On average, the beer production is increasing over time during the relevant time period. In the data, there are seasonal variations present which are due to fluctuations in the beer production. There is a pattern of recurring highs and lows each year. In every fourth quarter, the manufacturing of beer increases rapidly which might be due to the fact that it is summer in Australia. In the first quarter, the production goes down slightly and in the second quarter, the production continues to decrease. There are no outliers present in the diagram. No cyclical components or other major irregularities can be observed in the data. The model is additive since the seasonal variations are around the same magnitude during the relevant time span. 3. The time series data is reasonable since beer consumption varies depending on the season. Beer manufacturers might forecast and plan how much beer will be consumed and change the production of it accordingly. The beer companies know that the demand will increase during summer and they prepare for this by increasing production. In the data, it is shown that the beer production increases rapidly every fourth quarter which means there is a strong seasonal component to it. This is recurring again and again in the data during the investigated 12 years. Beer consumption depends on the season and it varies depending on what time of the year it is. In the time series, the production has therefore increased and decreased over time which can be very much expected. 8 4. The Excel file “Beer production Australia” contains quarterly data and it is therefore suitable to use the Excel sheet “Seasonally adjusted quarterly”. The blue line in the graph above represents the original data while the orange line is the trendline and it is seasonally adjusted. The trendline is trending upwards which means that the total beer production has increased during the past 12 years investigated. In the graph, it is clearly shown that the difference between the original values and the adjusted values are relatively small. The seasonally adjusted series varies much less than the original series depending on which quarter it is. The seasonal component is not as clear in the smoothed data compared to the original data and the purpose of doing a seasonal adjustment of the data is to gain greater understanding of the overall trend and make it more useful. For example, the trendline can be used to forecast future production of beer which enables better planning of the business (Newbold et al. 2020, p. 693-701). 9 References Newbold et al. (2020). Statistics for Business and Economics (9th edition). United Kingdom: Pearson Education. 10