Uploaded by Oscar Fromholtz Levin

Asg3grp03wrk05

advertisement
STOCKHOLM UNIVERSITY
Department of Statistics
Fall 2022
Cover page: Hand-in Assignment 3, Basic Statistics for Economists
3. Econometrics
Assignments’ teacher:
Mona Sfaxi
Group (1-10):
Seminar Group 3
Assignments’ Group (1-15):
Work Group 5
Data:
Dataset 10
Note! Always save your own version of the report
Group Members:
Name:
Elvljung Tilda
Date of birth:
00.05.13
E-mail
tildaelvljung@gmail.com
Fredriksson Engla
01.05.01
fredrikssonengla@gmail.com
Fromholtz Levin Oscar
01.05.23
levin7607@gmail.com
Lexner Hampus
99.11.09
lexnerhampus@gmail.com
Result after first deadline:
□
Pass
□
Fail
Comments:
Results after the second deadline:
□
Pass
Comments:
□
Fail
Part A: Regression Analysis
Problem 1
Create a correlation matrix, which includes all the numerical variables in the data set (five
variables), and answer the following questions. Remember to include the correlation matrix in your
report to support your arguments.
Items sold
R-price
C-price
Ad cost
Items sold
1
R-price
0,086806
1
C-price
0,717734
0,53493
1
Ad cost
0,596904
0,475578
0,829533
1
Price diff
0,793699
−0,00923
0,839923
0,676294
Price diff
1
In this case the dependent variable is items sold
(A) Which independent variable has the highest absolute correlation with the dependent variable?
The independent variable with the highest correlation to the dependent variable, "Items Sold," is "Price
Difference," with a correlation of 0.793699. This indicates a strong positive relationship between the two
variables, meaning that as "Price Difference" decreases, "Items Sold" is likely to decrease as well. Of all the
independent variables, "Price Difference" has the most significant relationship with "Items Sold."
Scatter plot: The correlation coefficient is a measure of the strength and direction of the relationship
between two variables. It ranges from -1 to 1, with a value of 0.793699 indicating a strong positive
relationship. This can be observed in a scatter plot, where an increase in "Price Difference" is associated with
an increase in "Items Sold."
1
(B) Which independent variable has the lowest absolute correlation with the dependent variable?
The independent variable that has the lowest absolute correlation with the dependent variable (items sold) is
Retailer price. The correlation is 0,086806 which means they have almost no correlation with each other. Out
of all the independent variables Retailer price relates the least with items sold. So if the Retailer price
changes, items sold are not likely to change as well.
Scatter plot: When the correlation is equal or close to 0, there is no or very little association between the
two variables. So with a correlation as low as 0,086806, it is clear that in the scatter plot the two variables
don’t move with each other.
SUMMARY
OUTPUT
Regression
Statistics
Multiple R
0,787029937
R Square
0,619416122
Adjusted R
Square
0,605320423
Standard Error
79,30234732
Observations
29
ANOVA
df
Regression
SS
MS
F
Significance
F
1 276355,4078 276355,4078 43,94362526 4,10713E-07
Residual
27 169799,2819 6288,862291
Total
28 446154,6897
Coefficients
Standard
Error
t Stat
P-value
Lower 95%
Upper 95%
Items sold
1394,709248 21,75897469 64,09811436 4,82778E-31
Price Diff
71,28314376 10,75322922 6,628998813 4,10713E-07 49,21933989 93,34694763
2
1350,06352 1439,354976
Problem 2
Choose the independent variable that you think will best explain the number of items sold and
estimate a simple linear regression model. Include the Excel output of your regression model in the
report.
We have chosen the independent variable Price Difference. We think Price Difference will explain the
number of items sold because it has the highest correlation out of all the independent variables.
A. According to Edgar Bueno during one of his lectures a R^2 value of 0,7 or higher is considered to be
good. So a R^2 value of 0,619 could be considered as decent. R^2 value is used to determine how good of a
fit the model is to the data. It is possible to calculate the adjusted R^2 value by using the formula.
R^2-adjusted= 1 -SSE/SST.
B. The coefficient of "price difference" indicates how much the quantity of items sold is expected to change
when the price differential changes by one unit, holding all other explanatory variables constant. By looking
at the excel output above, we can clearly see that if the price differs by three ‘units’ the amount of sold items
moves by 213 units. This is an increase in sales by 213 items. 200/3 ≈ 71. So every time the price differs by
one unit, we expect the sales to increase by approx 71 items.
C. We consider the null hypothesis
H0 to be true: H0: β1 = 0 = The regression coefficient variable is not significant from zero.
If not, the alternative hypothesis H1 : β1 ≠ 0 = The regression coefficient variable is significant from zero.
Critical value: 2,045
Tobs = 6,63 = (71,3/10,75)
Decision rule:
Reject H0 if Tobs > tn−2,α/2
Since the Tobs is larger than the critical value, we do reject the H0 and that means that the alternative
hypothesis H1 is set to be true. Which tells us that there is a 95% significant difference.
3
D.
Interval:
1608,9±168,49 → [1440,4:1777.4]
The interval tells us the distribution of the sample. So with 95% the predicted interval is between 1440 and
1777.
E. We calculated the confidence interval using the following formula:
B0 = 1395
B1 = 71,3
x=3
n = 30
x_bar = 1,45
tn-2,∝0,05/2 = 2,048
Se2 = 6288,9
Sx2 = 1,93
Interval:
[1564:1654]
With these calculations we can know with 95% confidence that y_hat is between 1564 and 1654. We see that
the prediction interval is larger. That is because we have a bigger standard error.
4
Problem 3:
a) The three variables are in linear combination of each other are Retailer price, Competitor price and
Price difference.
a) They can't all three be used as independent variables in the same model because they are highly
dependent on each other. Cause Price difference = Retailer price -Competitor price.
Problem 4: Independent variables we used: Price difference, Ad cost and Special offers
SUMMARY
OUTPUT
Regression Statistics
Multiple R
0,796527104
R Square
0,634455427
Adjusted R
Square
0,590590078
Standard Error
80,76866364
Observations
29
ANOVA
df
SS
MS
Regression
3
283065,264
94355,088
Residual
25
163089,4256
6523,577026
Total
28
446154,6897
Coefficients
Standard
Error
t Stat
F
14,46370413
P-value
Significance
F
1,14962E-05
Lower 95%
Upper 95%
Intercept
1287,396777
144,1559648
8,930582779
2,98519E-09
990,5020098
Ad Cost
0,018046207
0,025788143
0,699787019
0,490521269
−0,035065466 0,071157881
Price Diff
65,19600335
14,69719991
4,435947237
0,00016078
34,92655351
Special offers
26,21516721
33,76368995
0,776430753
0,444778988
−43,32245392 95,75278835
a)
1584,291544
95,46545319
The R^2 value when we use a multiple regression line is equal to R^2= 0,634455427. Which is
decent. The model fits fairly well with the data. We got a slightly higher R^2 value when using two
more independent variables. Which means that three independent variables has a slightly higher
effect on the dependent variable compared to only using one independent variable. This result is
expected in our case because, number of items sold is expected to depend on more than one variable.
5
An alternative to the R^2 is the adjusted R^2. It is more commonly used when having more than one
independent variable.
It simply adjusts for the number of variables and gives a more realistic estimate of the model's fit.
b)
The regression coefficients explain the relationship between the independent variables and the
dependent variable. The intercept or B0 explains the estimated value of the dependent variable when
all the independent variables are 0.
intercept/B0: (1287,39677696526)
So when all the independent variables are equal to 0 the number of items sold is = B0
B1: (0,018046207442899)
When Ad cost changes one unit and all the other independent variables doesn't change, the number
of items sold changes = B1.
b2: (65,1960033524042)
When Price difference changes one unit and all the other independent variables don't change, the
number of items sold changes = B2.
b3: (26,2151672123754)
When Special offers change one unit and all the other independent variables don't change, the
number of items sold changes = B3.
c)
None of the intervals contain the value of 0.
d) Price diff = 3
Special offer = 1
Ad cost = 2500
Formula used: Yhat = b0 + b1 * ad cost + b2 * Price diff + b3 * Special offers
Yhat = 11554,3
6
Problem 5:
Other independent variables to have concluded from a business administrative perspective could
have been for example:
Location of the retailer:
Pros: If the retailer is in a big city and sells less than one in a small city. They know that they
probably don’t use their fullest selling potential considering their potential customers.
Cons: Might be hard to find that information and hard to calculate, those variables might just make
the study harder.
Median income in the municipality:
Pros: If you know the median income. You in some way know the living standards in the place
where the retailer is. And a higher living standard usually correlates with people buying more.
Cons: This might also be hard to find that information. A bit easier to calculate than the location.
Considering it’s a numeric variable. But it might also be difficult and just make the study harder.
Number of items sold is a relevant dependent variable. When using that you see which of the
retailers sells the most and therefore probably does the best from a business perspective.
Another dependent variable that probably would be better than the number of items sold is profit. If
we were to use that you could in a more precise way see which retailer that does the best from a
business perspective. For instance if a retailer sells more items but has a lower price than the
competitor. The competitor could still do better from a business perspective, because they don't have
to sell as much when they have a higher price.
7
Part B: Time-series analysis
1)
Brief description of the data
We chose ‘’Car sales in Quebec’’. The variables are time and car sales with monthly data starting
from the year 1965-01 until 1968-12. Since this year's rage consists of 12 months per year the series
is 48 months long.
2)
Characteristics of the time series
It is clear that the number of cars sold varies seasonally every year. In the beginning of each year,
around January and February, and around August and September, the numbers of cars sold are the
lowest. The lowest number of sold units in the whole time range occurs in 1965-07 where 10895
cars were sold. After these negative trends the number of units increases a lot around April and
tends to be highest during May each year. The highest number of sold units in the whole time range
occurs in 1968-05 with an amount of 26099 sold units.
Besides the seasonally varies, the different amount of sold units that occurs each year can be
described for instance by these different factors: price, marketing, economic conditions and
competitions. Effects of different factors can be measured and described by additive and
multiplicative models in statistics. An additive model in statistics is a model that explains these
different factors and adds them together in order to get the total measure. In a multiplicative model
the variables multiplicate instead to get the total measure. It is difficult to determine whether the
time series follows an additive or multiplicative model without further information.
3)
Does the time series' properties seem reasonable considering what the series describes?
The time series´ properties seem logical since many people may want to buy a new car for the
summer to be able to make summer trips for instance. Therefore the most cars sold occur around
May and June. It is also reasonable that people don't find it appropriate to buy a new car at the
beginning of the new year since during December and January it is common that people have a lot
of outcomes. As already mentioned, there are for sure some other different factors that impact the
amount sold units as well.
4)
Seasonal adjusted monthly data
We choose to seasonally adjust the monthly data. When seasonally adjusting data the goal is to
remove the seasonal component of the series data set. The reason we do this is to make the data
more representative of the underlying economic conditions and trends. This makes it easier to
compare data from different seasons and also makes it easier to compare data from the same season
but different years. What we can see in our times series chart is that, as said before, there is a clear
8
seasonal component where car sales go up a lot during the late spring to the beginning of summer
and then go down in the middle of summer. We can also see a slight uphill for the car sales in the
autumn. The seasonal adjusted data does not have as extreme increases and decreases in cars sales
as the original data which implies that what season it is has a lot to do with how many cars that are
sold in Quebec. The increases and decreases of the red line (seasonal adjusted series) shows us an
estimate of what the car sales would have been if seasons did not have any effect and therefore we
can assume that what makes the red line fluctuate could depend on economic conditions and trends
at the time.
9
Download