FINAL EXAMINATION 2021-22 87497– Statistics Applied to Insurance and Risk Management 1 February 2022 INSTRUCTIONS PLEASE READ THE EXAMINATION PAPER IN FULL BEFORE ATTEMPTING TO ANSWER A SPECIFIC QUESTION. THIS EXAMINATION CONTAINS MULTIPLE-CHOICE AND SHORT-ANSWER QUESTIONS. IN THE MULTIPLE-CHOICE SECTION, PLEASE EITHER CIRCLE THE ANSWER THAT IS MOST CORRECT (IF COMPLETING BY HAND) OR HIGHLIGHT IT IN YELLOW IF RESPONDING ELECTRONICALLY. IN THE SHORT-ANSWER SECTION, PLEASE WRITE YOUR ANSWER IN THE SPACE PROVIDED. YOU SHOULD ANSWER ALL PARTS AND ALL QUESTIONS. THERE ARE SEVENTEEN (17) QUESTIONS, IN TOTAL. THE MARK PER QUESTION IS WRITTEN NEXT TO THE QUESTION NUMBER. FOR QUESTIONS WITH SUB-SECTIONS (a, b, c, etc.), THE SUB-SECTIONS ARE EQUALLY WEIGHTED. YOU HAVE TWO (2) HOURS TO COMPLETE THE EXAMINATION, PLUS TEN (10) MINUTES’ PERUSAL TIME AND TEN (10) MINUTES’ ADDITIONAL TIME TO EMAIL YOUR RESPONSE. EMAILED RESPONSES SHOULD BE EMAILED TO luke.connelly@unibo.it. LATE SUBMISSIONS WILL ATTRACT A PENALTY OF 10 MARKS PER FIFTEEN (15) MINUTES OR PART THEREOF. IT IS SUGGESTED THAT YOU USE THE PERUSAL TIME TO READ THE WHOLE PAPER CAREFULLY AND MAKE NOTES. [THIS SECTION IS INTENTIONALLY BLANK] 1 QUESTION 1 (2 MARKS) Consider the following simple regression model y = 0 + 1x1 + u. Suppose z is an instrument for x. Which of the following conditions denotes instrument relevance? a. Cov(z,u) > 0 Instrument relevance is indicated by the correlation between the instrument (z) b. Cov(z,u) < 0 and the endogenous variable (x). A non-zero correlation is necessary for the instrument to be considered relevant for addressing endogeneity in the regression c. Cov(z,x) 0 model d. Cov(z,x z) = 0 QUESTION 2 (2 MARKS) Consider the following simple regression model y = 0 + 1x1 + u. The variable z is a poor instrument for x if _____. a. there is a high correlation between z and x A poor instrument for x is characterized by a low correlation b. there is a low correlation between z and x between the instrument (z) and the endogenous variable (x). A weak correlation undermines the ability of the instrument to c. there is a high correlation between z and u effectively address endogeneity in the regression model d. there is a low correlation between z and u QUESTION 3 (2 MARKS) Which of the following correctly identifies a difference between cross-sectional data and time series data? a. Cross-sectional data is based on temporal ordering, whereas time series data is not. b. Time series data is based on temporal ordering, whereas cross-sectional data is not. c. Cross-sectional data consists of only qualitative variables, whereas time series data consists of only quantitative variables. d. Time series data consists of only qualitative variables, whereas cross-sectional data does not include qualitative variables. Time series data focuses on the same variable over a period of time, QUESTION 4 (2 MARKS) while cross-sectional data focuses on several variables at the same point in time. This difference is the key distinction between time series and cross-sectional data A static time-series model is postulated when: a. a change in the independent variable at time ‘t’ is believed to have an effect on the dependent variable at period ‘t + 1’. b. a change in the independent variable at time ‘t’ is believed to have an effect on the dependent variable for all successive time periods. c. a change in the independent variable at time ‘t’ does not have any effect on the dependent variable. d. a change in the independent variable at time ‘t’ is believed to have an immediate effect on the dependent variable. QUESTION 5 (2 MARKS) The value of R2 always _____. a. lies below 0 b. lies above 1 c. lies between 0 and 1 d. lies between 1 and 1.5 The coefficient of determination (R2) always ranges from 0 to 1, representing the proportion of the variance in the dependent variable that is predictable from the independent variable(s). An R2 of 0 indicates that the model does not explain any of the variability of the response data around its mean, and an R2 of 1 indicates that the model explains all the variability of the response data around its mean 2 QUESTION 6 (2 MARKS) Which of the following types of variables cannot be included in a fixed effects model? a. Dummy variable In a fixed effects model, time-constant independent variables can create issues b. Discrete dependent variable because they do not vary across time within the same entity (individual, firm, c. Time-varying independent variable etc.). Fixed effects models are designed to control for individual-specific characteristics that do not change over time. When a variable remains constant d. Time-constant independent variable for all observations within a particular entity, it becomes perfectly collinear with the fixed effects, leading to multicollinearity issues. QUESTION 7 (2 MARKS) A fixed effects model is a statistical model commonly used in econometrics and other fields to account for unobserved or time-invariant individual heterogeneity in panel data. A normal variable is standardized by: a. subtracting its mean from it and multiplying by its standard deviation. b. adding its mean to it and multiplying by its standard deviation. c. subtracting its mean from it and dividing by its standard deviation. d. adding its mean to it and dividing by its standard deviation. Standardizing a normal variable, also known as calculating a z-score, involves subtracting the mean of the variable from each observation and then dividing by the standard deviation. This process transforms the variable to have a mean of 0 and a standard deviation of 1, creating a standard normal distribution [THIS SECTION IS INTENTIONALLY BLANK] 3 INFORMATION FOR QUESTIONS 8-10… A dataset was generated by drawing a random sample of 239 Italian households. The dataset contains information on weekly household income (income) in hundreds of dollars and expenditure on food (foodex), in dollars, for each household. The dataset was used to obtain the following estimates of the relationship between income and foodex via OLS: 𝑓𝑜𝑜𝑑𝑒𝑥 = 8.59 + 12.04𝑖𝑛𝑐𝑜𝑚𝑒 (1) n = 239; R2 = 0.46 Although the t-statistics are not shown here, the intercept and coefficients are statistically significant at the one per cent level (p < 0.01). Table 1 contains the fitted values from the OLS regression of foodex on income for 15 of the households in the sample. The estimate of foodex is given as foodex_hat, and uhat are the residuals obtained from our OLS regression. The first column, obsno, is the observation number. Table 1 Fitted Values and Residuals for 15 Households obsno 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 income foodex $ 10.01 $ 12.11 $ 8.32 $ 5.75 $ 6.72 $ 6.02 $ 7.21 $ 6.12 $ 7.53 $ 6.21 $ 6.23 $ 8.20 $ 11.61 $ 10.50 $ 6.02 foodex_hat $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 132.52 129.80 108.17 91.23 50.91 60.48 101.81 90.68 81.66 87.27 86.51 107.23 146.30 108.19 69.95 4 $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 120.52 145.80 100.17 69.23 80.91 72.48 86.81 73.68 90.66 74.77 75.01 98.73 139.78 126.42 72.48 uhat $ -$ $ $ -$ -$ $ $ -$ $ $ $ $ -$ -$ 12.00 16.00 8.00 22.00 30.00 12.00 15.00 17.00 9.00 12.50 11.50 8.50 6.52 18.23 2.53 QUESTION 8 (5 MARKS) When household income is $0, what is the expected foodex? Explain your answer briefly in the space provided. Also provide a brief explanation of why one might be wary of placing too much emphasis on this estimate. When household income is $0, the expected food expenditure (foodex) is $8.59. This is calculated by substituting $0 for income in the given regression model: foodex = 8.59 + 12.04(0). QUESTION 9 (5 MARKS) (a) For how many of the households in Table 1 do the OLS estimates over-predict foodex? [Write the total number of households and identify each household by listing their observation number, obsno, in your answer.] From the table, we can see that for households with obsno 2, 5, 6, 9, 11, 12, and 14, the OLS estimates over-predict foodex because foodex_hat is greater than the actual foodex. So, for 7 households, the OLS estimates over-predict foodex. Therefore, the answer is: For 7 households in Table 1, the OLS estimates over-predict foodex. (b) For which household is the under-prediction the largest? [Identify the household by writing their observation number, obsno, in your answer.] 5 QUESTION 10 (7 MARKS) Now suppose that income is endogenous. (a) What are the implications of this for our OLS estimates of the marginal propensity to consume food? When income is endogenous, it means that income is correlated with the error term (u) in the regression equation. This violates the classical linear regression model (CLRM) assumptions, leading to biased and inefficient OLS estimates. Specifically, in the context of estimating the relationship between food expenditure (foodex) and income, endogeneity of income can have the following implications for OLS estimates of the marginal propensity to consume food: - Bias in coefficient estimates: The OLS estimates of the coefficient for income may be biased, leading to inaccurate assessment of the marginal impact of income on food expenditure. - Inefficiency: The estimates may be inefficient, resulting in wider confidence intervals and decreased precision in assessing the relationship. Invalid hypothesis testing: Standard hypothesis tests may be invalid, as they assume exogeneity of the independent variable, which is violated when income is endogenous. (b) Name at least one possible strategy for addressing the endogeneity of a right-hand-side variable in a regression. One common strategy for addressing the endogeneity of a right-hand-side variable in a regression is to use an instrumental variable (IV) approach. In this approach: - Instrumental variable: An instrumental variable is a variable that is correlated with the endogenous variable (income) but is not directly related to the dependent variable (foodex) except through its correlation with the endogenous variable. - Two-stage least squares (2SLS): The 2SLS method involves two stages. In the first stage, the endogenous variable (income) is regressed on the instrumental variable to obtain the predicted values (fitted values). In the second stage, the fitted values are used as a substitute for the endogenous variable in the main regression equation (foodex on income). This helps to address endogeneity and obtain consistent and unbiased estimates. 6 INFORMATION FOR QUESTION 11 Returning to the households dataset… consider an equation to explain food expenditure (foodex) in terms of the household’s income (income), number of children (children), and a variable that measures the distance (dist), in kilometres, between the household’s location and the centre of the closest major city: 𝑙𝑜𝑔( 𝑓𝑜𝑜𝑑𝑒𝑥) = 𝛽0 + 𝛽1 𝑙𝑜𝑔( 𝑖𝑛𝑐𝑜𝑚𝑒) + 𝛽2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 + 𝛽3 𝑑𝑖𝑠𝑡 + 𝑢 (2) Note that, in this specification, we take the log of foodex as the dependent variable and the log of income as an explanatory variable. (The variable children is in levels, i.e. it is simply a count of the number of children in the household; distance is also in levels, i.e. kilometres) . QUESTION 11 (5 MARKS) Comparing Equation (2) to Equation (1), provide a detailed explanation of the effect you expect the addition of children and dist to the equation, including how you would test your hypotheses about the effect of those two variables on the dependent variable. INFORMATION FOR QUESTIONS 12-14… Now suppose we estimate Equation (3) on the same dataset we used to estimate Equation (1) (i.e., the 239 households), and obtain the following results: log(𝑓𝑜𝑜𝑑𝑒𝑥) = 7.10 + 0.59 log(𝑖𝑛𝑐𝑜𝑚𝑒) + 0.12(𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛) − 2.05(𝑑𝑖𝑠𝑡) (3) n = 239; R2 = 0.51 The standard errors and t-statistics are not shown, but suppose the intercept and all estimated coefficients are statistically significant at the one per cent level (p < 0.01). 7 QUESTION 12 (6 MARKS) Compare the results obtained by estimating Equation (2) and the results obtained by estimating Equation (1). a) Which model specification do you prefer, and why? Provide a detailed explanation for your preference, based on statistical reasons. b) How do the two models compare in terms of goodness-of-fit? Explanation: Equation (3) has a higher R² (0.51) compared to Equation (2) (0.46). The higher R² in Equation (3) suggests that the inclusion of the variables (children and dist) in Equation (3) contributes to a better fit, explaining more of the variability in log(foodex) compared to the model in Equation (2). In summary, based on the higher R² and better goodness-of-fit, Equation (3) is preferred for explaining the relationship between log(foodex), log(income), children, and dist in the given dataset. 8 QUESTION 13 (7 MARKS) Write an explanation of the meaning of these results, explaining the relationship between foodex and the income, children and dvurban. Write your explanation in plain English. (NB: you should explain the relationship between the levels of these variables, i.e. do not refer to the effect of an explanatory variable on the logarithm of the dependent variable, but a change in the untransformed dependent variable.) Income: When families' income goes up, they tend to spend about 0.59 times more on food. In other words, a wealthier household is likely to allocate a higher proportion of its budget to food. Number of Children: For each additional child in the household, there's an increase of 0.12 in food expenditure. So, having more children is associated with a modest bump in the amount spent on food. Distance to Major City: If a household is located one kilometer farther from a major city, they are expected to spend 2.05 less on food. This suggests that households in more remote areas tend to spend less on their food needs. Overall Relationship: More income generally means more spending on food. Having more children is connected to a slight increase in food spending. Living farther from major cities is associated with spending less on food. In a nutshell, the model helps us understand how changes in income, the number of children, and distance from major cities correspond to changes in food expenditure for households. QUESTION 14 (6 MARKS) Suppose we also have a dummy variable called rural which =1 if the household is in a rural area, and =0 otherwise. Suppose we were to add this dummy variable to the specification and that the adjusted-R2 increases, but neither of the coefficients on dist or rural are statistically significant at the 10% level. Do you prefer this model, or model (3) (which excludes rural), and what you would do, in Stata, in response to this result? 9 QUESTION 15 (15 MARKS) Case Study: FICO Eataly World In 2017, Fabbrica Italiana Contadina (FICO) Eataly World opened within 20 minutes’ drive of central Bologna. FICO—“the largest food-park in the world”—is a complex of approximately 100,000 square metres containing more than 40 restaurants, more than 100 “traditional” shops, and a range of food and wine production exhibits and displays, as well as offering a range of related activities (e.g., food and wine tasting, pasta-making). The FICO development has not been without its critics and, in December 2016, there were street protests against the development. While the interest groups may have protested for a variety of reasons, one concerned section of the local community was existing food and wine vendors in central Bologna. Their primary concern was that the new development would reduce their sales. Suppose that a dataset has been generated to test the hypothesis that FICO reduced the sales of the existing food and wine vendors in central Bologna. Specifically, suppose we have a random sample of sales for vendors in five locations: Bologna, Florence, Verona, Riccione, and Trento for two (2) financial years: 2014-2015 (before the FICO development or rumours about it started) and 2017-2018. Assume that the dataset contains the following variables: • • • • sales = annual sales in real € (€ 2017-2018) for the financial year. advert = the annual advertising expenditure in real € (€ 2017-2018) for the financial year. Bol = 1 if the vendor is located in central Bologna; = 0 otherwise. empl = the number of employees the vendor employed that financial year. Our hypothetical sample lends itself to a difference-in-difference (DID) design. We could estimate it in two steps, or we could implement it using the first-difference panel estimator (FDPE) approach. Recall that the advantage of the latter is that it will give us the standard errors and t-statistics we need to conduct hypothesis tests. We can use this approach to estimate the effect of FICO Eataly World on sales to test the null hypothesis of no difference due to the development, against the alternative hypothesis that sales were affected (either increased or decreased in response to the development). 1. Provide a detailed description of how you would specify the FDPE to estimate the effect of FICO Eataly World on the sales of local food and wine vendors in Bologna. 2. Clearly write out the equation you would seek to estimate. 3. Clearly explain what hypothesis tests can be conducted on each of the parameters in your model. 4. Comment, in particular, on which parameter will form the DiD estimate of the causal effect of the opening of FICO Eataly World on local vendors’ sales. 5. Finally, comment on possible threats to your identification strategy. 10 INFORMATION FOR QUESTION 16… The MROZ dataset we used in laboratory sessions can be used to produce the return on schooling for married women. The following results were obtained by running a simple regression of the logarithm of wages, lwage, on years of education (educ). Table 2A: OLS Results . regress lwage hours kidslt6 educ Source SS df MS Model Residual 27.2303518 196.097089 3 424 9.07678392 .462493135 Total 223.327441 427 .523015084 lwage Coef. hours kidslt6 educ _cons -4.40e-06 -.1194906 .111202 -.1950308 Std. Err. .0000431 .0858103 .0145368 .1970541 t -0.10 -1.39 7.65 -0.99 Number of obs F(3, 424) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.919 0.165 0.000 0.323 = = = = = = 428 19.63 0.0000 0.1219 0.1157 .68007 [95% Conf. Interval] -.000089 -.2881572 .0826289 -.5823554 .0000802 .049176 .1397751 .1922937 The dataset also contains information on the years of education of respondents’ mothers (motheduc) and fathers (fatheduc). Those two variables could be used as IVs for educ. The next results show the outcome of IV estimation. Note that both the first-stage regression and the IV estimates of the wages model are presented here. 11 Table 2B: IV Estimates . ivreg lwage hours kidslt6 (educ=motheduc fatheduc), first First-stage regressions Source SS df MS Model Residual 486.833444 1743.36282 4 423 121.708361 4.1214251 Total 2230.19626 427 5.22294206 educ Coef. hours kidslt6 motheduc fatheduc _cons -.0000842 .5402097 .154359 .1842193 9.56807 Std. Err. .0001286 .254834 .0357009 .0335627 .3680946 t -0.65 2.12 4.32 5.49 25.99 Number of obs F(4, 423) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.513 0.035 0.000 0.000 0.000 = = = = = = 428 29.53 0.0000 0.2183 0.2109 2.0301 [95% Conf. Interval] -.0003369 .039311 .0841857 .1182488 8.844548 .0001685 1.041108 .2245323 .2501897 10.29159 Instrumental variables (2SLS) regression Source SS df MS Model Residual 19.5870672 203.740374 3 424 6.52902241 .480519749 Total 223.327441 427 .523015084 lwage Coef. educ hours kidslt6 _cons .0521064 -.0000121 -.0774927 .5572256 Instrumented: Instruments: Std. Err. .0328511 .000044 .0899143 .4238401 t 1.59 -0.28 -0.86 1.31 Number of obs F(3, 424) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.113 0.783 0.389 0.189 = = = = = = 428 0.95 0.4145 0.0877 0.0813 .6932 [95% Conf. Interval] -.0124649 -.0000987 -.2542261 -.2758637 .1166778 .0000745 .0992406 1.390315 educ hours kidslt6 motheduc fatheduc QUESTION 16 (15 MARKS) Using the information provided above: 1. Provide a detailed explanation of the relationship between education and wages, according to the OLS results. 2. Provide a detailed rationale for adopting an IV approach, instead of estimating the regression via OLS. 3. Explain the rationale of using both motheduc and fatheduc as IVs for educ: do the results suggest these are good IVs? 4. Compare the OLS and IV results, commenting on any important similarities and differences between them. 5. Comment specifically on the statistical significance of the IV results for educ: why do we often see this type of outcome when IVs are used? 12 QUESTION 17 (16 Marks) The data in Figure 1 are from the website Our World in Data. Both plots show life expectancy at birth in 2014, on the vertical (y) axis, for an international cross-section of countries. Figure 1(a) shows per capita health care expenditure (in international dollars), in levels. Figure 1(b) shows per capita health expenditure in logarithms. Figure 1(a) 13 Figure 1(b) 1. Using Figure 1, describe the relationship between per capita health expenditure and life expectancy at birth. 2. Write out a regression that includes these two variables and mention at least three other variables that you would like to include in your model, if the data were available. 3. Is there any argument you can think of that would render the per capita health care expenditure endogeneous in this model? Explain your answer. 4. Our World in Data actually has panel data available for these countries, annually, from 1991 through 2014. How would you take advantage of these data to improve the model you described for cross-sectional data? Be specific about (a) what type of model you would implement, and (b) the advantages of that model over a simple cross-sectional model of the relationship between life expectancy at birth and per capita health expenditure. 14 END OF PAPER 15