Uploaded by mebeat

Econometrics HW3 Group23

advertisement
Econometrics Homework 3
Ashhad Khalid (S4468236)
Slobozeanu Razvan (S3942449)
Question 1
a. Dr Jerry is referring to the endogeneity problem in econometrics. Endogeneity occurs
when the independent variable is correlated with the error term in the regression
equation, making it difficult to establish a causal relationship between the independent
and dependent variables.In this case, the endogeneity problem happens because the
level of Dutch imports from China may be affected by demand, which, in turn, affects the
change in employment. If the endogeneity problem exists in Tom's regression model, it
can affect the validity of his results. As a result, Tom's insignificant results could be due
to the endogeneity problem, rather than the absence of a relationship between the
independent and dependent variables.
b. The use of Chinese import growth in other developed countries as an instrument is most
likely valid for several reasons.First, the correlation between Dutch imports from China
and other developed countries is high (as shown in Table 1), indicating that they are
likely affected by similar global economic forces.Second, using Chinese import growth in
other developed countries as an instrument allows for the identification of a causal
relationship between changes in foreign supply and Dutch imports. If changes in
Chinese imports to other countries lead to changes in Dutch imports, then we can
reasonably infer that the trade shock from China is affecting the Netherlands. Third, it is
unlikely that Dutch imports are affecting Chinese imports to other countries, which
strengthens the case for the validity of the instrument. This is because the Netherlands is
a relatively small player in global trade compared to China and other developed
countries. Overall, the use of Chinese import growth in other developed countries as an
instrument appears to be a valid approach to isolate the foreign-supply-driven
component of Dutch import penetration.
c. To implement a Two-Stage Least Squares (2SLS) regression using the average imports
of eight other developed economies from China as an instrument, we take the following
steps: First, specify your model, including the dependent variable (e.g., Dutch imports
from China), the explanatory variables (e.g., Dutch GDP, Chinese GDP, etc.) and include
the instrument variable which is the average imports of eight other developed economies
from China. In the first stage of the 2SLS regression, you would regress the instrument
variable-the average imports of eight other developed economies from China) on the
explanatory variables in your model. The resulting estimated coefficients would be used
to predict the values of the instrument variable for each observation in the data set. In
the second stage, you would regress the dependent variable on the predicted values of
the instrument variable and the other explanatory variables in your model. The resulting
coefficients would provide estimates of the causal effect of the explanatory variables on
the dependent variable, after controlling for the endogenous relationship between the
dependent variable and the instrument variable. Finally, we do some tests to check the
validity of the instrument variable, such as examining the F-test for the first-stage
regression (which should be greater than 10).
d. The graph portrays a high residual difference as the years of education increase thus
resulting in the predicted values being high. Additionally, it is possible that the multiple
regression model suffers from a violation of the assumption of no multicollinearity
between the independent variables, specifically between "educ" and "educ*exper". This
is because the model includes an interaction term between "educ" and "exper", which
means that the effect of education on wages may vary depending on years of
experience. However, if there is also a high correlation between "educ" and
"educ*exper", then it becomes difficult to disentangle the separate effects of education
and experience on wages, and the estimates for both variables may become unreliable.
e. Yes, the residuals for a given level of education are much lesser than before. By taking a
log function it is possible that they addressed the problem of multicollinearity between
"educ" and "educ*exper". This is because taking the logarithm of the dependent variable
can help to "spread out" the differences between predicted values of the variable, which
can help to reduce the correlation between the interaction term and the main effect of
education. Using logarithms on the dependent variable can have several potential
beneficial consequences: (i)Non-linear relationships: Taking the logarithm of a variable
can help to capture non-linear relationships between the variable and the dependent
variable. (ii)Homoscedasticity: Transforming the dependent variable into logarithmic
form can help to address the issue of heteroscedasticity, which occurs when the
variance of the dependent variable is not constant across different levels of the
independent variables. (iii)Interpretation: Taking the logarithm of the dependent variable
can make the results of the regression more easily interpretable.
f. d(wage)/d(educ) = 0 + 0.0464 + 0.003081 * exper + 0.00000 - 0 = 0.0464 + 0.003081 *
exper
Plugging in exper = 8 results in d(wage)/d(educ) = 0.071. This means that at 8 years of
experience, an extra year of education results in an increase in wages of 7.1%.
g. Not including intelligence in the regression model can bias the effect of education on
wages because intelligence is likely to be correlated with both education and wages.If
intelligence is omitted from the model, the coefficient for education will take its place and
be skewed, possibly overestimating intelligence's impact on wages.This is because the
effect of education on wages may be partly or entirely due to the effect of intelligence on
both education and wages. To achieve a more accurate approximation of the true
influence of education on wages, adjusting for intelligence is crucial.
Question 2
(a)
We used the command “encode” to destring the variables which are in string format
(Gender, Smokingstatus, Sleepduration, Exercisefrequency). For variables Gender, we recode
that Male = 1 and Female = 0. We did the same for Smokingstatus, if a person smokes the
variable is equal to one, and if not equal to 0.
Descriptive Statistics
Variable
Obs
Mean
Std.Dev.
Min
Max
ID
458
229.5
132.357
1
458
Age
458
40.207
13.137
9
69
Sleepeffic~y
458
1.019
4.594
.5
99
Awakenings
438
1.648
1.355
0
4
Caffeineco~n
433
23.388
30.095
0
200
Alcoholcon~n
442
1.249
1.639
0
5
gender_num
458
.507
.501
0
1
Smokingsta~m
458
.358
.48
0
1
Sleepdurat~m
458
5.86
1.552
1
9
Exercisefr~m
458
3.793
1.509
1
9
The table provides descriptive statistics for a dataset of 458 observations and nine
variables. The "Sleepefficiency" variable represents the proportion of time spent in bed that is
spent asleep and ranges from 0.5 to 99, with a mean of 1.019 and a standard deviation of
4.594. The data for "Sleepefficiency" variable should range from 0 to 1. Still, in our model based
on descriptive statistics, we can see that the max is 99, so we used the command “drop if
Sleepeffciency >1” after, Stata deleted five observations that include measurement error.
Variable
Obs
Mean
Std.Dev.
Min
Max
ID
453
228.057
131.603
1
458
Age
453
40.296
13.16
9
69
Sleepeffic~y
453
.789
.135
.5
.99
Awakenings
433
1.644
1.357
0
4
Caffeineco~n
428
23.598
30.189
0
200
Alcoholcon~n
437
1.243
1.643
0
5
gender_num
453
.506
.501
0
1
Smokingsta~m
453
.355
.479
0
1
Sleepdurat~m
453
5.852
1.549
1
9
Exercisefr~m
453
3.757
1.454
1
7
(b)
Linear regression
Sleepefficiency
Coef.
St.Err.
t-value
p-value
[95% Conf
Interval]
Sig
0.019
0.004
4.95
0.000
0.011
0.026
***
gender_num
-0.002
0.011
-0.15
0.879
-0.024
0.021
Age
0.001
0.000
1.25
0.213
0.000
0.001
-0.080
0.011
-7.05
0.000
-0.102
-0.057
***
-0.034
0.003
-10.37
0.000
-0.040
-0.027
***
0.769
0.023
33.77
0.000
0.725
0.814
***
Exercisefrequenc
y_~m
Smokingstatus_nu
m
Alcoholconsumpti
on
Constant
Mean dependent
var
0.790
SD dependent var
0.135
R-squared
0.327
Number of obs
F-test
41.876
Prob > F
Akaike crit. (AIC)
-673.555
Bayesian crit. (BIC)
437.000
0.000
-649.075
*** p<0.01, ** p<0.05, * p<0.1
The coefficient for "Exercisefrequency” (times/week) in the linear regression output
represents the estimated effect of the predictor variable "Exercisefrequency" on the outcome
variable "Sleepefficiency". Specifically, for each additional time per week an individual exercises,
the sleep efficiency quality is estimated to increase by 0.019 units, holding all other predictor
variables in the model constant. The coefficient is positive and statistically significant (indicated
by the t-value of 4.95 and p-value of 0.000), which suggests that higher exercise frequency is
associated with better sleep efficiency. This means that individuals who exercise more
frequently are likely to have a better quality of sleep compared to those who exercise less
frequently, after controlling for other variables in the model. It is important to note that the
coefficient represents the estimated association between exercise frequency and sleep
efficiency, and does not necessarily imply a causal relationship. There may be other factors that
influence both exercise frequency and sleep efficiency, which were not included in the model,
and could potentially confound the relationship.
The variable “Gender” has a coefficient equal to -0.002 which indicates that for every
one-unit increase in gender (i.e., from male to a female), the sleep efficiency is estimated to
decrease by 0.002 units, holding all other predictor variables in the model constant. However,
the coefficient for "gender" is not statistically significant (indicated by the t-value of -0.15 and
p-value of 0.879). This means that there is not enough evidence to conclude that gender is
associated with sleep efficiency in this model, after controlling for other variables such as
exercise frequency, smoking status, alcohol consumption, and age.
(c)
Linear regression
Sleepefficiency
Coef.
St.Err.
t-value
p-value
[95% Conf
Interval]
Sig
0.020
0.005
4.05
0.000
0.010
0.030
***
0.011
0.031
0.36
0.722
-0.050
0.072
-0.003
0.008
-0.44
0.659
-0.019
0.012
0.001
0.000
1.25
0.211
0.000
0.001
-0.079
0.011
-6.91
0.000
-0.101
-0.056
***
-0.034
0.003
-10.36
0.000
-0.040
-0.027
***
0.764
0.026
29.75
0.000
0.714
0.815
***
Exercisefrequenc
y_~m
gender_num
Exercisefreq_Gen
der
Age
Smokingstatus_nu
m
Alcoholconsumpti
on
Constant
Mean dependent
var
0.790
SD dependent var
0.135
R-squared
0.327
Number of obs
437.000
F-test
34.864
Akaike crit. (AIC)
-671.753
Prob > F
0.000
Bayesian crit. (BIC)
-643.193
*** p<0.01, ** p<0.05, * p<0.1
In this updated model, a new variable "Exercisefreq_Gender" is included, which is the
interaction between exercise frequency and gender. The coefficient for this variable is -0.003
with a p-value of 0.659, indicating that the interaction term is not significant. The coefficient
-.003 suggests that there is no significant difference in the relationship between exercise
frequency and sleep efficiency between males and females. In other words, the effect of
exercise frequency on sleep efficiency is similar for both males and females..The coefficient for
"Exercisefrequency_m" is still significant at 0.020 with a p-value of 0.000, indicating that for
every unit increase in exercise frequency above the mean, sleep efficiency increases by 0.020
units on average, holding all other variables constant. The coefficient for "gender_num" is not
significant at 0.011 with a p-value of 0.722, indicating that there is no significant difference in
sleep efficiency between males and females, holding all other variables constant. The other
variables in the model, "Age", "Smokingstatus_num", and "Alcoholconsumption", have
coefficients that are similar to the previous model and are all significant at the 0.01 level.
(d)
Linear regression
Sleepefficiency
Coef.
St.Err.
t-value
p-value
[95% Conf
Interval]
Sig
0.008
0.004
1.89
0.059
0.000
0.017
*
0.009
0.026
0.34
0.738
-0.043
0.060
0.000
0.007
0.05
0.958
-0.013
0.013
0.001
0.000
1.57
0.117
0.000
0.001
-0.087
0.010
-9.03
0.000
-0.106
-0.068
***
-0.027
0.003
-9.45
0.000
-0.032
-0.021
***
Awakenings
-0.048
0.004
-13.52
0.000
-0.055
-0.041
***
Constant
0.873
0.023
37.40
0.000
0.827
0.919
***
Exercisefrequenc
y_~m
gender_num
Exercisefreq_Gen
der
Age
Smokingstatus_nu
m
Alcoholconsumpti
on
Mean dependent
var
0.790
SD dependent var
0.135
R-squared
0.545
Number of obs
417.000
F-test
70.021
Prob > F
Akaike crit. (AIC)
-798.291
Bayesian crit. (BIC)
0.000
-766.026
*** p<0.01, ** p<0.05, * p<0.1
The coefficient of "alcohol consumption" in the previous model was -0.034, while in the
current model that includes "awakenings," it is -0.027. The difference in these
coefficients is small, but it suggests that the relationship between alcohol consumption
and sleep efficiency may be slightly weaker when accounting for the effect of
awakenings. However, it is important to note that the coefficient is still negative and
statistically significant, indicating that higher levels of alcohol consumption are
associated with lower sleep efficiency, even after controlling for other variables such as
exercise frequency, gender, age, and smoking status. Overall, which model to choose
depends on the research question and the purpose of the analysis. If the goal is to
understand the relationship between alcohol consumption and sleep efficiency while
controlling for other factors, the current model that includes awakenings may be more
appropriate. However, if awakenings are irrelevant to the research question, the
previous model that only includes exercise frequency, gender, age, smoking status, and
alcohol consumption may be sufficient.
(e)
Linear regression
Sleepefficiency
Coef.
St.Err.
t-value
p-value
[95% Conf
Interval]
0.006
0.004
1.27
0.203
-0.003
0.014
0.000
0.026
-0.01
0.992
-0.052
0.051
Exercisefrequency
_~m
gender_num
Sig
0.003
0.007
0.48
0.629
-0.010
0.016
Age
0.006
0.002
2.83
0.005
0.002
0.010
***
age_squared
0.000
0.000
-2.61
0.009
0.000
0.000
***
-0.087
0.010
-9.06
0.000
-0.106
-0.068
***
-0.026
0.003
-9.05
0.000
-0.031
-0.020
***
Awakenings
-0.047
0.004
-13.51
0.000
-0.054
-0.040
***
Constant
0.777
0.044
17.80
0.000
0.691
0.863
***
Exercisefreq_Gend
er
Smokingstatus_nu
m
Alcoholconsumpti
on
Mean dependent var
0.790
SD dependent var
0.135
R-squared
0.553
Number of obs
417.000
F-test
62.986
Prob > F
Akaike crit. (AIC)
-803.181
Bayesian crit. (BIC)
0.000
-766.884
*** p<0.01, ** p<0.05, * p<0.1
The coefficient for the variable age^2 is 0.000, and it is statistically significant at the
0.01 level. This indicates that there is a non-linear relationship between age and sleep
efficiency. Specifically, the coefficient suggests that as age increases, the effect of age
on sleep efficiency first increases, and then decreases. This type of relationship is
called a "U-shaped" relationship, where the effect of the independent variable first
increases and then decreases as it changes. In this case, it suggests an age at which
sleep efficiency is maximized, and beyond that age, sleep efficiency starts to decline.
However, to determine at what age the effect of age on sleep efficiency is maximized,
further analysis or visualization is needed:
The scalar command Age_max calculates the age at which the effect of age on sleep
quality is maximised. The formula used to calculate this is based on the coefficients for age and
age squared in the last regression model. The result of the calculation is 44.9 years old, which
means that the effect of age on sleep quality is maximized at this age. This result suggests that
as individuals age beyond this age, the effect of age on sleep quality starts to diminish.
Download