Uploaded by Andrea Agostino Ventura

FINAL EXAMINATION 2022 II Call

advertisement
FINAL EXAMINATION 2021-22
87497– Statistics Applied to Insurance and Risk Management
1 February 2022
INSTRUCTIONS
PLEASE READ THE EXAMINATION PAPER IN FULL BEFORE ATTEMPTING TO ANSWER A SPECIFIC
QUESTION.
THIS EXAMINATION CONTAINS MULTIPLE-CHOICE AND SHORT-ANSWER QUESTIONS.
IN THE MULTIPLE-CHOICE SECTION, PLEASE EITHER CIRCLE THE ANSWER THAT IS MOST CORRECT (IF
COMPLETING BY HAND) OR HIGHLIGHT IT IN YELLOW IF RESPONDING ELECTRONICALLY.
IN THE SHORT-ANSWER SECTION, PLEASE WRITE YOUR ANSWER IN THE SPACE PROVIDED.
YOU SHOULD ANSWER ALL PARTS AND ALL QUESTIONS. THERE ARE SEVENTEEN (17) QUESTIONS, IN
TOTAL.
THE MARK PER QUESTION IS WRITTEN NEXT TO THE QUESTION NUMBER.
FOR QUESTIONS WITH SUB-SECTIONS (a, b, c, etc.), THE SUB-SECTIONS ARE EQUALLY WEIGHTED.
YOU HAVE TWO (2) HOURS TO COMPLETE THE EXAMINATION, PLUS TEN (10) MINUTES’ PERUSAL
TIME AND TEN (10) MINUTES’ ADDITIONAL TIME TO EMAIL YOUR RESPONSE. EMAILED RESPONSES
SHOULD BE EMAILED TO luke.connelly@unibo.it. LATE SUBMISSIONS WILL ATTRACT A PENALTY OF
10 MARKS PER FIFTEEN (15) MINUTES OR PART THEREOF.
IT IS SUGGESTED THAT YOU USE THE PERUSAL TIME TO READ THE WHOLE PAPER CAREFULLY AND
MAKE NOTES.
[THIS SECTION IS INTENTIONALLY BLANK]
1
QUESTION 1 (2 MARKS)
Consider the following simple regression model y = 0 + 1x1 + u. Suppose z is an instrument for x.
Which of the following conditions denotes instrument relevance?
a. Cov(z,u) > 0
Instrument relevance is indicated by the correlation between the instrument (z)
b. Cov(z,u) < 0
and the endogenous variable (x). A non-zero correlation is necessary for the
instrument
to be considered relevant for addressing endogeneity in the regression
c. Cov(z,x) 0
model
d. Cov(z,x z) = 0
QUESTION 2 (2 MARKS)
Consider the following simple regression model y = 0 + 1x1 + u. The variable z is a poor
instrument for x if _____.
a. there is a high correlation between z and x
A poor instrument for x is characterized by a low correlation
b. there is a low correlation between z and x
between the instrument (z) and the endogenous variable (x). A
weak correlation undermines the ability of the instrument to
c. there is a high correlation between z and u
effectively address endogeneity in the regression model
d. there is a low correlation between z and u
QUESTION 3 (2 MARKS)
Which of the following correctly identifies a difference between cross-sectional data and time series
data?
a. Cross-sectional data is based on temporal ordering, whereas time series data is not.
b. Time series data is based on temporal ordering, whereas cross-sectional data is not.
c. Cross-sectional data consists of only qualitative variables, whereas time series data
consists of only quantitative variables.
d. Time series data consists of only qualitative variables, whereas cross-sectional data does
not include qualitative variables.
Time series data focuses on the same variable over a period of time,
QUESTION 4 (2 MARKS)
while cross-sectional data focuses on several variables at the same
point in time. This difference is the key distinction between time series
and cross-sectional data
A static time-series model is postulated when:
a. a change in the independent variable at time ‘t’ is believed to have an effect on the
dependent variable at period ‘t + 1’.
b. a change in the independent variable at time ‘t’ is believed to have an effect on the
dependent variable for all successive time periods.
c. a change in the independent variable at time ‘t’ does not have any effect on the dependent
variable.
d. a change in the independent variable at time ‘t’ is believed to have an immediate effect on
the dependent variable.
QUESTION 5 (2 MARKS)
The value of R2 always _____.
a. lies below 0
b. lies above 1
c. lies between 0 and 1
d. lies between 1 and 1.5
The coefficient of determination (R2) always ranges from 0 to 1, representing the
proportion of the variance in the dependent variable that is predictable from the
independent variable(s). An R2 of 0 indicates that the model does not explain any
of the variability of the response data around its mean, and an R2 of 1 indicates
that the model explains all the variability of the response data around its mean
2
QUESTION 6 (2 MARKS)
Which of the following types of variables cannot be included in a fixed effects model?
a. Dummy variable
In a fixed effects model, time-constant independent variables can create issues
b. Discrete dependent variable
because they do not vary across time within the same entity (individual, firm,
c. Time-varying independent variable etc.). Fixed effects models are designed to control for individual-specific
characteristics that do not change over time. When a variable remains constant
d. Time-constant independent variable for all observations within a particular entity, it becomes perfectly collinear with
the fixed effects, leading to multicollinearity issues.
QUESTION 7 (2 MARKS)
A fixed effects model is a statistical model commonly used in econometrics and
other fields to account for unobserved or time-invariant individual heterogeneity
in panel data.
A normal variable is standardized by:
a. subtracting its mean from it and multiplying by its standard deviation.
b. adding its mean to it and multiplying by its standard deviation.
c. subtracting its mean from it and dividing by its standard deviation.
d. adding its mean to it and dividing by its standard deviation.
Standardizing a normal variable, also known as calculating a z-score, involves
subtracting the mean of the variable from each observation and then dividing by
the standard deviation. This process transforms the variable to have a mean of 0
and a standard deviation of 1, creating a standard normal distribution
[THIS SECTION IS INTENTIONALLY BLANK]
3
INFORMATION FOR QUESTIONS 8-10…
A dataset was generated by drawing a random sample of 239 Italian households. The dataset
contains information on weekly household income (income) in hundreds of dollars and
expenditure on food (foodex), in dollars, for each household. The dataset was used to obtain
the following estimates of the relationship between income and foodex via OLS:
𝑓𝑜𝑜𝑑𝑒𝑥 = 8.59 + 12.04𝑖𝑛𝑐𝑜𝑚𝑒
(1)
n = 239; R2 = 0.46
Although the t-statistics are not shown here, the intercept and coefficients are statistically
significant at the one per cent level (p < 0.01).
Table 1 contains the fitted values from the OLS regression of foodex on income for 15 of the
households in the sample. The estimate of foodex is given as foodex_hat, and uhat are the
residuals obtained from our OLS regression. The first column, obsno, is the observation
number.
Table 1 Fitted Values and Residuals for 15 Households
obsno
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
income
foodex
$ 10.01
$ 12.11
$ 8.32
$ 5.75
$ 6.72
$ 6.02
$ 7.21
$ 6.12
$ 7.53
$ 6.21
$ 6.23
$ 8.20
$ 11.61
$ 10.50
$ 6.02
foodex_hat
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
132.52
129.80
108.17
91.23
50.91
60.48
101.81
90.68
81.66
87.27
86.51
107.23
146.30
108.19
69.95
4
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
120.52
145.80
100.17
69.23
80.91
72.48
86.81
73.68
90.66
74.77
75.01
98.73
139.78
126.42
72.48
uhat
$
-$
$
$
-$
-$
$
$
-$
$
$
$
$
-$
-$
12.00
16.00
8.00
22.00
30.00
12.00
15.00
17.00
9.00
12.50
11.50
8.50
6.52
18.23
2.53
QUESTION 8 (5 MARKS)
When household income is $0, what is the expected foodex? Explain your answer briefly in
the space provided. Also provide a brief explanation of why one might be wary of placing
too much emphasis on this estimate.
When household income is $0, the expected food expenditure (foodex) is $8.59. This is
calculated by substituting $0 for income in the given regression model: foodex = 8.59 +
12.04(0).
QUESTION 9 (5 MARKS)
(a) For how many of the households in Table 1 do the OLS estimates over-predict foodex?
[Write the total number of households and identify each household by listing their
observation number, obsno, in your answer.]
From the table, we can see that for households with obsno 2,
5, 6, 9, 11, 12, and 14, the OLS estimates over-predict foodex
because foodex_hat is greater than the actual foodex. So, for
7 households, the OLS estimates over-predict foodex.
Therefore, the answer is: For 7 households in Table 1, the
OLS estimates over-predict foodex.
(b) For which household is the under-prediction the largest? [Identify the household by
writing their observation number, obsno, in your answer.]
5
QUESTION 10 (7 MARKS)
Now suppose that income is endogenous.
(a) What are the implications of this for our OLS estimates of the marginal propensity to
consume food?
When income is endogenous, it means that income is correlated with the error term (u) in the
regression equation. This violates the classical linear regression model (CLRM) assumptions,
leading to biased and inefficient OLS estimates. Specifically, in the context of estimating the
relationship between food expenditure (foodex) and income, endogeneity of income can have
the following implications for OLS estimates of the marginal propensity to consume food:
- Bias in coefficient estimates: The OLS estimates of the coefficient for income may be biased,
leading to inaccurate assessment of the marginal impact of income on food expenditure.
- Inefficiency: The estimates may be inefficient, resulting in wider confidence intervals and
decreased precision in assessing the relationship.
Invalid hypothesis testing: Standard hypothesis tests may be invalid, as they assume
exogeneity of the independent variable, which is violated when income is endogenous.
(b) Name at least one possible strategy for addressing the endogeneity of a right-hand-side
variable in a regression.
One common strategy for addressing the endogeneity of a right-hand-side variable in a
regression is to use an instrumental variable (IV) approach. In this approach:
- Instrumental variable: An instrumental variable is a variable that is correlated with the
endogenous variable (income) but is not directly related to the dependent variable (foodex)
except through its correlation with the endogenous variable.
- Two-stage least squares (2SLS): The 2SLS method involves two stages. In the first stage, the
endogenous variable (income) is regressed on the instrumental variable to obtain the predicted
values (fitted values). In the second stage, the fitted values are used as a substitute for the
endogenous variable in the main regression equation (foodex on income). This helps to address
endogeneity and obtain consistent and unbiased estimates.
6
INFORMATION FOR QUESTION 11
Returning to the households dataset… consider an equation to explain food expenditure
(foodex) in terms of the household’s income (income), number of children (children), and a
variable that measures the distance (dist), in kilometres, between the household’s location
and the centre of the closest major city:
𝑙𝑜𝑔( 𝑓𝑜𝑜𝑑𝑒𝑥) = 𝛽0 + 𝛽1 𝑙𝑜𝑔( 𝑖𝑛𝑐𝑜𝑚𝑒) + 𝛽2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 + 𝛽3 𝑑𝑖𝑠𝑡 + 𝑢
(2)
Note that, in this specification, we take the log of foodex as the dependent variable and the
log of income as an explanatory variable. (The variable children is in levels, i.e. it is simply a
count of the number of children in the household; distance is also in levels, i.e. kilometres) .
QUESTION 11 (5 MARKS)
Comparing Equation (2) to Equation (1), provide a detailed explanation of the effect you
expect the addition of children and dist to the equation, including how you would test your
hypotheses about the effect of those two variables on the dependent variable.
INFORMATION FOR QUESTIONS 12-14…
Now suppose we estimate Equation (3) on the same dataset we used to estimate Equation
(1) (i.e., the 239 households), and obtain the following results:
log(𝑓𝑜𝑜𝑑𝑒𝑥) = 7.10 + 0.59 log(𝑖𝑛𝑐𝑜𝑚𝑒) + 0.12(𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛) − 2.05(𝑑𝑖𝑠𝑡)
(3)
n = 239; R2 = 0.51
The standard errors and t-statistics are not shown, but suppose the intercept and all
estimated coefficients are statistically significant at the one per cent level (p < 0.01).
7
QUESTION 12 (6 MARKS)
Compare the results obtained by estimating Equation (2) and the results obtained by
estimating Equation (1).
a) Which model specification do you prefer, and why? Provide a detailed explanation for
your preference, based on statistical reasons.
b) How do the two models compare in terms of goodness-of-fit?
Explanation:
Equation (3) has a higher R² (0.51) compared to Equation (2) (0.46).
The higher R² in Equation (3) suggests that the inclusion of the variables (children and dist) in Equation
(3) contributes to a better fit, explaining more of the variability in log(foodex) compared to the model in
Equation (2).
In summary, based on the higher R² and better goodness-of-fit, Equation (3) is preferred for explaining
the relationship between log(foodex), log(income), children, and dist in the given dataset.
8
QUESTION 13 (7 MARKS)
Write an explanation of the meaning of these results, explaining the relationship between
foodex and the income, children and dvurban. Write your explanation in plain English. (NB:
you should explain the relationship between the levels of these variables, i.e. do not refer to
the effect of an explanatory variable on the logarithm of the dependent variable, but a change
in the untransformed dependent variable.)
Income:
When families' income goes up, they tend to spend about 0.59 times more on food. In other words, a wealthier
household is likely to allocate a higher proportion of its budget to food.
Number of Children:
For each additional child in the household, there's an increase of 0.12 in food expenditure. So, having more children is
associated with a modest bump in the amount spent on food.
Distance to Major City:
If a household is located one kilometer farther from a major city, they are expected to spend 2.05 less on food. This
suggests that households in more remote areas tend to spend less on their food needs.
Overall Relationship:
More income generally means more spending on food.
Having more children is connected to a slight increase in food spending.
Living farther from major cities is associated with spending less on food.
In a nutshell, the model helps us understand how changes in income, the number of children, and distance from major
cities correspond to changes in food expenditure for households.
QUESTION 14 (6 MARKS)
Suppose we also have a dummy variable called rural which =1 if the household is in a rural
area, and =0 otherwise. Suppose we were to add this dummy variable to the specification and
that the adjusted-R2 increases, but neither of the coefficients on dist or rural are statistically
significant at the 10% level. Do you prefer this model, or model (3) (which excludes rural), and
what you would do, in Stata, in response to this result?
9
QUESTION 15 (15 MARKS)
Case Study: FICO Eataly World
In 2017, Fabbrica Italiana Contadina (FICO) Eataly World opened within 20 minutes’ drive of
central Bologna. FICO—“the largest food-park in the world”—is a complex of approximately
100,000 square metres containing more than 40 restaurants, more than 100 “traditional”
shops, and a range of food and wine production exhibits and displays, as well as offering a
range of related activities (e.g., food and wine tasting, pasta-making).
The FICO development has not been without its critics and, in December 2016, there were
street protests against the development. While the interest groups may have protested for a
variety of reasons, one concerned section of the local community was existing food and wine
vendors in central Bologna. Their primary concern was that the new development would
reduce their sales.
Suppose that a dataset has been generated to test the hypothesis that FICO reduced the sales
of the existing food and wine vendors in central Bologna.
Specifically, suppose we have a random sample of sales for vendors in five locations:
Bologna, Florence, Verona, Riccione, and Trento for two (2) financial years: 2014-2015
(before the FICO development or rumours about it started) and 2017-2018.
Assume that the dataset contains the following variables:
•
•
•
•
sales = annual sales in real € (€ 2017-2018) for the financial year.
advert = the annual advertising expenditure in real € (€ 2017-2018) for the financial
year.
Bol = 1 if the vendor is located in central Bologna; = 0 otherwise.
empl = the number of employees the vendor employed that financial year.
Our hypothetical sample lends itself to a difference-in-difference (DID) design. We could
estimate it in two steps, or we could implement it using the first-difference panel estimator
(FDPE) approach. Recall that the advantage of the latter is that it will give us the standard
errors and t-statistics we need to conduct hypothesis tests. We can use this approach to
estimate the effect of FICO Eataly World on sales to test the null hypothesis of no difference
due to the development, against the alternative hypothesis that sales were affected (either
increased or decreased in response to the development).
1. Provide a detailed description of how you would specify the FDPE to estimate the
effect of FICO Eataly World on the sales of local food and wine vendors in Bologna.
2. Clearly write out the equation you would seek to estimate.
3. Clearly explain what hypothesis tests can be conducted on each of the parameters in
your model.
4. Comment, in particular, on which parameter will form the DiD estimate of the causal
effect of the opening of FICO Eataly World on local vendors’ sales.
5. Finally, comment on possible threats to your identification strategy.
10
INFORMATION FOR QUESTION 16…
The MROZ dataset we used in laboratory sessions can be used to produce the return on schooling
for married women.
The following results were obtained by running a simple regression of the logarithm of wages,
lwage, on years of education (educ).
Table 2A: OLS Results
. regress lwage hours kidslt6 educ
Source
SS
df
MS
Model
Residual
27.2303518
196.097089
3
424
9.07678392
.462493135
Total
223.327441
427
.523015084
lwage
Coef.
hours
kidslt6
educ
_cons
-4.40e-06
-.1194906
.111202
-.1950308
Std. Err.
.0000431
.0858103
.0145368
.1970541
t
-0.10
-1.39
7.65
-0.99
Number of obs
F(3, 424)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.919
0.165
0.000
0.323
=
=
=
=
=
=
428
19.63
0.0000
0.1219
0.1157
.68007
[95% Conf. Interval]
-.000089
-.2881572
.0826289
-.5823554
.0000802
.049176
.1397751
.1922937
The dataset also contains information on the years of education of respondents’ mothers
(motheduc) and fathers (fatheduc). Those two variables could be used as IVs for educ. The next
results show the outcome of IV estimation. Note that both the first-stage regression and the IV
estimates of the wages model are presented here.
11
Table 2B: IV Estimates
. ivreg lwage hours kidslt6 (educ=motheduc fatheduc), first
First-stage regressions
Source
SS
df
MS
Model
Residual
486.833444
1743.36282
4
423
121.708361
4.1214251
Total
2230.19626
427
5.22294206
educ
Coef.
hours
kidslt6
motheduc
fatheduc
_cons
-.0000842
.5402097
.154359
.1842193
9.56807
Std. Err.
.0001286
.254834
.0357009
.0335627
.3680946
t
-0.65
2.12
4.32
5.49
25.99
Number of obs
F(4, 423)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.513
0.035
0.000
0.000
0.000
=
=
=
=
=
=
428
29.53
0.0000
0.2183
0.2109
2.0301
[95% Conf. Interval]
-.0003369
.039311
.0841857
.1182488
8.844548
.0001685
1.041108
.2245323
.2501897
10.29159
Instrumental variables (2SLS) regression
Source
SS
df
MS
Model
Residual
19.5870672
203.740374
3
424
6.52902241
.480519749
Total
223.327441
427
.523015084
lwage
Coef.
educ
hours
kidslt6
_cons
.0521064
-.0000121
-.0774927
.5572256
Instrumented:
Instruments:
Std. Err.
.0328511
.000044
.0899143
.4238401
t
1.59
-0.28
-0.86
1.31
Number of obs
F(3, 424)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.113
0.783
0.389
0.189
=
=
=
=
=
=
428
0.95
0.4145
0.0877
0.0813
.6932
[95% Conf. Interval]
-.0124649
-.0000987
-.2542261
-.2758637
.1166778
.0000745
.0992406
1.390315
educ
hours kidslt6 motheduc fatheduc
QUESTION 16 (15 MARKS)
Using the information provided above:
1. Provide a detailed explanation of the relationship between education and wages, according
to the OLS results.
2. Provide a detailed rationale for adopting an IV approach, instead of estimating the
regression via OLS.
3. Explain the rationale of using both motheduc and fatheduc as IVs for educ: do the results
suggest these are good IVs?
4. Compare the OLS and IV results, commenting on any important similarities and differences
between them.
5. Comment specifically on the statistical significance of the IV results for educ: why do we
often see this type of outcome when IVs are used?
12
QUESTION 17 (16 Marks)
The data in Figure 1 are from the website Our World in Data. Both plots show life expectancy at birth
in 2014, on the vertical (y) axis, for an international cross-section of countries. Figure 1(a) shows per
capita health care expenditure (in international dollars), in levels. Figure 1(b) shows per capita health
expenditure in logarithms.
Figure 1(a)
13
Figure 1(b)
1. Using Figure 1, describe the relationship between per capita health expenditure and life
expectancy at birth.
2. Write out a regression that includes these two variables and mention at least three other
variables that you would like to include in your model, if the data were available.
3. Is there any argument you can think of that would render the per capita health care
expenditure endogeneous in this model? Explain your answer.
4. Our World in Data actually has panel data available for these countries, annually, from 1991
through 2014. How would you take advantage of these data to improve the model you
described for cross-sectional data? Be specific about (a) what type of model you would
implement, and (b) the advantages of that model over a simple cross-sectional model of the
relationship between life expectancy at birth and per capita health expenditure.
14
END OF PAPER
15
Download