STOCKHOLM UNIVERSITY Department of Statistics Fall 2022 Cover page: Hand-in Assignment 3, Basic Statistics for Economists 3. Econometrics Assignments’ teacher: Mona Sfaxi Group (1-10): Seminar Group 3 Assignments’ Group (1-15): Work Group 5 Data: Dataset 10 Note! Always save your own version of the report Group Members: Name: Elvljung Tilda Date of birth: 00.05.13 E-mail tildaelvljung@gmail.com Fredriksson Engla 01.05.01 fredrikssonengla@gmail.com Fromholtz Levin Oscar 01.05.23 levin7607@gmail.com Lexner Hampus 99.11.09 lexnerhampus@gmail.com Result after first deadline: □ Pass □ Fail Comments: Results after the second deadline: □ Pass Comments: □ Fail Part A: Regression Analysis Problem 1 Create a correlation matrix, which includes all the numerical variables in the data set (five variables), and answer the following questions. Remember to include the correlation matrix in your report to support your arguments. Items sold R-price C-price Ad cost Items sold 1 R-price 0,086806 1 C-price 0,717734 0,53493 1 Ad cost 0,596904 0,475578 0,829533 1 Price diff 0,793699 −0,00923 0,839923 0,676294 Price diff 1 In this case the dependent variable is items sold (A) Which independent variable has the highest absolute correlation with the dependent variable? The independent variable with the highest correlation to the dependent variable, "Items Sold," is "Price Difference," with a correlation of 0.793699. This indicates a strong positive relationship between the two variables, meaning that as "Price Difference" decreases, "Items Sold" is likely to decrease as well. Of all the independent variables, "Price Difference" has the most significant relationship with "Items Sold." Scatter plot: The correlation coefficient is a measure of the strength and direction of the relationship between two variables. It ranges from -1 to 1, with a value of 0.793699 indicating a strong positive relationship. This can be observed in a scatter plot, where an increase in "Price Difference" is associated with an increase in "Items Sold." 1 (B) Which independent variable has the lowest absolute correlation with the dependent variable? The independent variable that has the lowest absolute correlation with the dependent variable (items sold) is Retailer price. The correlation is 0,086806 which means they have almost no correlation with each other. Out of all the independent variables Retailer price relates the least with items sold. So if the Retailer price changes, items sold are not likely to change as well. Scatter plot: When the correlation is equal or close to 0, there is no or very little association between the two variables. So with a correlation as low as 0,086806, it is clear that in the scatter plot the two variables don’t move with each other. SUMMARY OUTPUT Regression Statistics Multiple R 0,787029937 R Square 0,619416122 Adjusted R Square 0,605320423 Standard Error 79,30234732 Observations 29 ANOVA df Regression SS MS F Significance F 1 276355,4078 276355,4078 43,94362526 4,10713E-07 Residual 27 169799,2819 6288,862291 Total 28 446154,6897 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Items sold 1394,709248 21,75897469 64,09811436 4,82778E-31 Price Diff 71,28314376 10,75322922 6,628998813 4,10713E-07 49,21933989 93,34694763 2 1350,06352 1439,354976 Problem 2 Choose the independent variable that you think will best explain the number of items sold and estimate a simple linear regression model. Include the Excel output of your regression model in the report. We have chosen the independent variable Price Difference. We think Price Difference will explain the number of items sold because it has the highest correlation out of all the independent variables. A. According to Edgar Bueno during one of his lectures a R^2 value of 0,7 or higher is considered to be good. So a R^2 value of 0,619 could be considered as decent. R^2 value is used to determine how good of a fit the model is to the data. It is possible to calculate the adjusted R^2 value by using the formula. R^2-adjusted= 1 -SSE/SST. B. The coefficient of "price difference" indicates how much the quantity of items sold is expected to change when the price differential changes by one unit, holding all other explanatory variables constant. By looking at the excel output above, we can clearly see that if the price differs by three ‘units’ the amount of sold items moves by 213 units. This is an increase in sales by 213 items. 200/3 ≈ 71. So every time the price differs by one unit, we expect the sales to increase by approx 71 items. C. We consider the null hypothesis H0 to be true: H0: β1 = 0 = The regression coefficient variable is not significant from zero. If not, the alternative hypothesis H1 : β1 ≠ 0 = The regression coefficient variable is significant from zero. Critical value: 2,045 Tobs = 6,63 = (71,3/10,75) Decision rule: Reject H0 if Tobs > tn−2,α/2 Since the Tobs is larger than the critical value, we do reject the H0 and that means that the alternative hypothesis H1 is set to be true. Which tells us that there is a 95% significant difference. 3 D. Interval: 1608,9±168,49 → [1440,4:1777.4] The interval tells us the distribution of the sample. So with 95% the predicted interval is between 1440 and 1777. E. We calculated the confidence interval using the following formula: B0 = 1395 B1 = 71,3 x=3 n = 30 x_bar = 1,45 tn-2,∝0,05/2 = 2,048 Se2 = 6288,9 Sx2 = 1,93 Interval: [1564:1654] With these calculations we can know with 95% confidence that y_hat is between 1564 and 1654. We see that the prediction interval is larger. That is because we have a bigger standard error. 4 Problem 3: a) The three variables are in linear combination of each other are Retailer price, Competitor price and Price difference. a) They can't all three be used as independent variables in the same model because they are highly dependent on each other. Cause Price difference = Retailer price -Competitor price. Problem 4: Independent variables we used: Price difference, Ad cost and Special offers SUMMARY OUTPUT Regression Statistics Multiple R 0,796527104 R Square 0,634455427 Adjusted R Square 0,590590078 Standard Error 80,76866364 Observations 29 ANOVA df SS MS Regression 3 283065,264 94355,088 Residual 25 163089,4256 6523,577026 Total 28 446154,6897 Coefficients Standard Error t Stat F 14,46370413 P-value Significance F 1,14962E-05 Lower 95% Upper 95% Intercept 1287,396777 144,1559648 8,930582779 2,98519E-09 990,5020098 Ad Cost 0,018046207 0,025788143 0,699787019 0,490521269 −0,035065466 0,071157881 Price Diff 65,19600335 14,69719991 4,435947237 0,00016078 34,92655351 Special offers 26,21516721 33,76368995 0,776430753 0,444778988 −43,32245392 95,75278835 a) 1584,291544 95,46545319 The R^2 value when we use a multiple regression line is equal to R^2= 0,634455427. Which is decent. The model fits fairly well with the data. We got a slightly higher R^2 value when using two more independent variables. Which means that three independent variables has a slightly higher effect on the dependent variable compared to only using one independent variable. This result is expected in our case because, number of items sold is expected to depend on more than one variable. 5 An alternative to the R^2 is the adjusted R^2. It is more commonly used when having more than one independent variable. It simply adjusts for the number of variables and gives a more realistic estimate of the model's fit. b) The regression coefficients explain the relationship between the independent variables and the dependent variable. The intercept or B0 explains the estimated value of the dependent variable when all the independent variables are 0. intercept/B0: (1287,39677696526) So when all the independent variables are equal to 0 the number of items sold is = B0 B1: (0,018046207442899) When Ad cost changes one unit and all the other independent variables doesn't change, the number of items sold changes = B1. b2: (65,1960033524042) When Price difference changes one unit and all the other independent variables don't change, the number of items sold changes = B2. b3: (26,2151672123754) When Special offers change one unit and all the other independent variables don't change, the number of items sold changes = B3. c) None of the intervals contain the value of 0. d) Price diff = 3 Special offer = 1 Ad cost = 2500 Formula used: Yhat = b0 + b1 * ad cost + b2 * Price diff + b3 * Special offers Yhat = 11554,3 6 Problem 5: Other independent variables to have concluded from a business administrative perspective could have been for example: Location of the retailer: Pros: If the retailer is in a big city and sells less than one in a small city. They know that they probably don’t use their fullest selling potential considering their potential customers. Cons: Might be hard to find that information and hard to calculate, those variables might just make the study harder. Median income in the municipality: Pros: If you know the median income. You in some way know the living standards in the place where the retailer is. And a higher living standard usually correlates with people buying more. Cons: This might also be hard to find that information. A bit easier to calculate than the location. Considering it’s a numeric variable. But it might also be difficult and just make the study harder. Number of items sold is a relevant dependent variable. When using that you see which of the retailers sells the most and therefore probably does the best from a business perspective. Another dependent variable that probably would be better than the number of items sold is profit. If we were to use that you could in a more precise way see which retailer that does the best from a business perspective. For instance if a retailer sells more items but has a lower price than the competitor. The competitor could still do better from a business perspective, because they don't have to sell as much when they have a higher price. 7 Part B: Time-series analysis 1) Brief description of the data We chose ‘’Car sales in Quebec’’. The variables are time and car sales with monthly data starting from the year 1965-01 until 1968-12. Since this year's rage consists of 12 months per year the series is 48 months long. 2) Characteristics of the time series It is clear that the number of cars sold varies seasonally every year. In the beginning of each year, around January and February, and around August and September, the numbers of cars sold are the lowest. The lowest number of sold units in the whole time range occurs in 1965-07 where 10895 cars were sold. After these negative trends the number of units increases a lot around April and tends to be highest during May each year. The highest number of sold units in the whole time range occurs in 1968-05 with an amount of 26099 sold units. Besides the seasonally varies, the different amount of sold units that occurs each year can be described for instance by these different factors: price, marketing, economic conditions and competitions. Effects of different factors can be measured and described by additive and multiplicative models in statistics. An additive model in statistics is a model that explains these different factors and adds them together in order to get the total measure. In a multiplicative model the variables multiplicate instead to get the total measure. It is difficult to determine whether the time series follows an additive or multiplicative model without further information. 3) Does the time series' properties seem reasonable considering what the series describes? The time series´ properties seem logical since many people may want to buy a new car for the summer to be able to make summer trips for instance. Therefore the most cars sold occur around May and June. It is also reasonable that people don't find it appropriate to buy a new car at the beginning of the new year since during December and January it is common that people have a lot of outcomes. As already mentioned, there are for sure some other different factors that impact the amount sold units as well. 4) Seasonal adjusted monthly data We choose to seasonally adjust the monthly data. When seasonally adjusting data the goal is to remove the seasonal component of the series data set. The reason we do this is to make the data more representative of the underlying economic conditions and trends. This makes it easier to compare data from different seasons and also makes it easier to compare data from the same season but different years. What we can see in our times series chart is that, as said before, there is a clear 8 seasonal component where car sales go up a lot during the late spring to the beginning of summer and then go down in the middle of summer. We can also see a slight uphill for the car sales in the autumn. The seasonal adjusted data does not have as extreme increases and decreases in cars sales as the original data which implies that what season it is has a lot to do with how many cars that are sold in Quebec. The increases and decreases of the red line (seasonal adjusted series) shows us an estimate of what the car sales would have been if seasons did not have any effect and therefore we can assume that what makes the red line fluctuate could depend on economic conditions and trends at the time. 9