EMET2008 Course Material Release 1.0 Juergen Meinecke October 23, 2014 Contents 1 Announcements 3 2 Slides (weeks 8 through 13) 5 3 Course Material 3.1 Course Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Reading List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Illustration of Central Limit Theorem using Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 11 12 21 27 i ii EMET2008 Course Material, Release 1.0 This website hosts important content for the ANU course EMET2008. Contents 1 EMET2008 Course Material, Release 1.0 2 Contents CHAPTER 1 Announcements Note: • Final exam: Monday, 10 November from 9:25am to 11:00am, location: HA GO51 • Practice final now online: Practice Final • Special troubleshooting and exam solving tute: Thursday, 6 November at 10:00am Location: Arndt TR3 (If you can’t make it, please come see me during consultation times instead.) • No tute during week 13 • Consultation times: – Tue 1:00pm to 4:00pm (week 12, 13, and 14) – Thu 1:00pm to 4:00pm (week 14) • Assignment 2 deadline: Wednesday 29 October, 2:00pm • Midterm Midterm • Midterm answer key: Midterm answer key 3 EMET2008 Course Material, Release 1.0 4 Chapter 1. Announcements CHAPTER 2 Slides (weeks 8 through 13) I am switching to a more conventional presentation style, using slides (instead of writing on the white board). I hope, most of you prefer this. • Weeks 8 and 9: Slides • Weeks 10 and 11: Slides • Weeks 12 and 13: Slides 5 EMET2008 Course Material, Release 1.0 6 Chapter 2. Slides (weeks 8 through 13) CHAPTER 3 Course Material 3.1 Course Outline Read this entire course outline carefully! Any items, rules, requirements in this course outline may be subject to changes. When this happens I will announce it during the lecture. Announcements in the lecture supersede any information contained in this course outline. 3.1.1 Course Description This course presents and develops techniques necessary for the quantitative analysis of economic and business problems that are beyond the scope of the simple linear regression model covered in EMET2007 or STAT2008. Topics include: endogeneity, natural experiments, binary dependent variables, time series regressions and panel data estimation. This is a hand-on course with a focus on applications in economics as well as business. A standard statistical software will be used during computer sessions, no special programming skills are required. Learning Outcome Upon successful completion of the requirements for this course, students will • understand the challenges of empirical modelling in economics and business • understand the shortcomings of the standard linear regression model • be able to apply important extensions to the linear regression model • be able to express new econometric methods mathematically • be able to think clearly about the relationship between data, model and estimation in econometrics • use statistical software to study actual data sets Topics Covered I intend to teach the following set of topics. • Brief review of OLS estimation (2 weeks) • Endogeneity: when OLS fails (1 week) 7 EMET2008 Course Material, Release 1.0 • Instrumental variables estimation (2 weeks) • Experiments and quasi-experiments (2 weeks) • Binary dependent variables (2 weeks) • Panel data and time series models (4 weeks) If you are interested in any other topic not given here, feel free to let me know as I am happy to adapt the course and incorporate your ideas and preferences. Note that the indicated number of weeks given within parentheses are just estimates and may differ as we go along. Prerequisites To enrol in this course you must have completed • ECON1101 and • EMET2007 or STAT2008. Communication Important: The official website for this course is <http://EMET2008.Readthedocs.org> I will frequently make announcements on the homepage of the Course Website (under “Announcements”). The official forum for announcements of any kind are the lectures. If necessary, I will contact students electronically using their official ANU student e-mail address. If you want to contact me send an e-mail to juergen.meinecke@anu.edu.au E-mail addresses are only to be used when you need to contact staff about administrative or academic matters. They are NOT to be used for instructional purposes. Textbook The textbook for the course is Introduction to Econometrics (third edition, 2012) by Stock and Watson. Chiefly library has several copies of the textbook. I strongly recommend that you buy a copy of the book as I base the lecture and practice sessions on it. Other excellent textbooks include A Guide to Modern Econometrics by Verbeek and Introductory Econometrics: A Modern Approach 5ed, by Wooldridge. Software The econometric software for this course is “Stata”. Here’s a quick wiki summary of what Stata is: <http://en.wikipedia.org/wiki/Stata>. From my own experience, Stata is an exhaustive, welldocumented, powerful and user-friendly statistical software. We will get to know Stata during the tutorial in a “learning-by-doing manner”. 8 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 Staff Administrative For any administrative inquiries or problems (e.g., tutorial enrollment, exam scheduling, supplementary exams, etc.) you should contact Terry Embling (School of Economics Course Administrator) or Finola Wijnberg (School of Economics School Administrator). Name Job title Office Location Hours E-mail Terry Embling Course administrator HW Arndt Building 25a Room 1013 9:00-16:00 terry.embling@anu.edu.au Finola Wijnberg School administrator HW Arndt Building 25a Room 1014 9:00-16:00 finola.wijnberg@anu.edu.au Academic If you have any academic inquiries or problems regarding the course, please don’t hesitate to contact me: Name Office Location Hours E-mail Juergen Meinecke HW Arndt Building 25a Room 1022 Tue 13:00-16:00 juergen.meinecke@anu.edu.au Lectures and Tutorials There will be four hours of contact time per week: a two hour lecture and a two hour practice session. You are expected to attend all of these. If you have persistent time conflicts with any of these class sessions, you should not be taking the course. Although content will be made available digitally (for example through audio recordings) you should not treat virtual attendance as a perfect substitute for physical attendance. The class meets in the following venues at the following times: Day Type Time Location Tuesday Lecture 10-12 CBE Bld LT4 Thursday Problem Solving 10-11 COP GO25 Thursday Computing Session 11-12 COP GO25 As you can see, the two hour practice sessions happen on Thursdays and can be subdivided into a one hour problem solving session and a one hour computing session. We will not always treat these two sub-sessions as strictly separate and instead regard the two as one big practice session that combines both theoretical exercises with computing exercises. Digital Lecture Delivery Audio recordings of the Tuesday lecture will be made available on Wattle. The Thursday sessions (tute and computing) will not be made available on Wattle (they are group learning sessions and as such do not lend themselves to audio recordings). 3.1. Course Outline 9 EMET2008 Course Material, Release 1.0 Workload University study requires at least as much time and effort as a full–time job. You are expected to attend all lectures and tutorials (4 hours per week). You should expect to put in at least 6 hours per week of your own study time for this course in addition to the 4 hours of lectures and tutorials. 3.1.2 Course Assessment The following table summarizes the assessable items for the course. Assessment Item Assignment 1 Midterm exam Assignment 2 Final exam Practice session participation Due date Thursday, week 6 Week 7 Friday, week 13 TBA Throughout Weight 10% 25% 10% 45% 10% Note, all assessment items are compulsory. If you miss any one item without approval by the School or College, you will fail the entire course! Assignments Working through exercises is an effective method of learning econometrics, as it is with most mathematical subjects. That means that the assignments are more than simply part of the assessment for the course. Students will be required to submit two written assignments during the semester. The assignments will require computer work as well as analytical work. These assignments should be your own work. You may discuss assignments with classmates, but you should do all your own computing and writing of the assignments. It is an offense against the University’s regulations to copy from other students’ assignments. Assignments should be submitted by dropping them into a specially labeled assignment box at the Research School of Economics. (Contact Terry Embling for details.) The front page of the submitted assignments must show your name, student number and the course name (EMET2008). Assignments missing any of this information will receive a mark of zero. Assignments must be submitted by 2pm on the due date. If you have a university approved excuse for not handing in an assignment, then the value of the final exam will be increased by 10 percentage points to compensate for the missed work. Further details about assignment submission will be given during lectures. Midterm Examination The midterm examination will be held during practice session time on Thursday of week 7. The exam covers all material from weeks 1 through 6 of the course. The exam will be marked out of 100. It is your responsibility to make yourself available for the midterm examination. No make-up midterm examination will be offered. Should you miss the midterm exam for a valid reason (see Rules and Policies below) then your grade will be based solely on your final exam. 10 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 Final Examination Examinable material covers the whole semester, including material already covered in the midterm exam. The exam will be marked out of 100. The final exam will be held in the exam period at the end of the semester. Details will be posted on the ANU exam timetable site. Practice Session Participation Your participation is an essential part in the overall learning experience (both for you as well as your classmates!) in the course. I will evaluate you on your participation during the Thursday practice sessions. Feel free to participate and contribute to the sessions. Do not be afraid to give wrong answers; as long as you are constructively engaged, there is no such thing as a wrong answer. Every Thursday after practice sessions I will take note of students who participated in class and at the end of the semester I will aggregate these numbers to an overall participation mark. Roughly, I will give 10 marks to regular participators, 5 marks to occasional participators and zero marks to students who rarely or never participate. Feel free to seek feedback from me during the semester on your participation performance. Scaling of Grades Final scores for the course will be determined by scaling the raw score totals to fit a sensible distribution of grades. Scaling can increase or decrease a mark but does not change the order of marks relative to the other students in the course. If it is decided that scaling is appropriate, then the final mark awarded in a course may differ from the aggregation of the raw marks of each assessment component. 3.1.3 Rules and Policies It is your responsibility to familiarize yourself with the rules and regulations and the policies and procedures that are relevant to your studies at the ANU. ANU has educational policies, procedures and guidelines, which are designed to ensure that staff and students are aware of the University’s academic standards, and implement them. You can find the University’s education policies and an explanatory glossary at: ANU Policies. Students are expected to have read the Student Academic Integrity Policy before the commencement of their course. Other key policies include: • Student Assessment (Coursework) • Student Surveys and Evaluations The University also offers a number of support services for students. Information on these is available online from ANU Studentlife. 3.2 Reading List Note: This reading list assists you in finding the textbook sections that cover the material discussed during that week’s lecture and practice session. Occasionally, the references provided in the table below go beyond what was discussed in class. In those cases, the reading is only recommended for deepening your understanding; it is, however, not required. 3.2. Reading List 11 EMET2008 Course Material, Release 1.0 WeekTextbook sections 0 (assumed knowledge) chapters 4 through 7 1 2.5, 2.6, 3.1 2 3.1, 3.2, 3.3 3 4 4.1, 4.2, 4.4, 6.5, 6.7 1.2, 6.1, 9.2, 9.4 5 12.1 through 12.5 6 13.1, 13.2, 13.3 Key concepts covered linear regression with one regressor linear regression with multiple regressors hypothesis tests and confidence intervals population random sample iid statistical inference random variable population mean vs. sample average estimator vs. estimate unbiasedness consistency, law of large numbers, convergence in probability central limit theorem unbiasedness consistency efficiency BLUE central limit theorem Monte Carlo simulation hypothesis testing confidence interval standard error brief review of OLS endogeneity causal effect omitted variables bias sample selection bias simultaneity bias, reverse causality measurement error bias instrumental variables estimation instrument relevance instrument exogeneity TSLS estimation first stage; reduced form equation structural equation weak instrument rule of thumb experiments randomized control trial causal effect treatment effect internal and external validity threats to validity 3.3 Exercises Note: Answers to exercises will only be provided during class time. If you cannot make it to class, you will need to see me during consultation times and we will work through the exercises together. (When you see me during consultation times, I expect you to be prepared. I will never merely provide answers to exercises. Instead, I want to see good faith effort on your part in which case I will be more than happy to 12 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 help you work throught the exercises.) 3.3.1 Week 1 Problem Solving 1. Prove that the sample average Ȳ is an unbiased estimator for the population mean. 2. Prove that in the linear model Yi = µ + ε i the ordinary least squares estimator of µ is equal to the sample average. Mathematically, minimize the sum of least squares ∑in=1 (Yi − µ̂)2 and show that this is obtained by setting µ̂ = Ȳ. 3. Is there a difference between an estimator and an estimate? Computational We will use this first practice session to become familiar with Stata. 1. Work through the “Stata for Researchers” website, you can find the link below. I will use this exercise to teach you some basic tricks you should know about Stata. For Stata help and support I can highly recommend the following two sources: • The Social Science Computing Cooperative at the University of Wisconsin provides excellent support for Stata beginners. Check out their website “Stata for Students”: <http://www.ssc.wisc.edu/sscc/pubs/stata_students1.htm> Also, check out their website “Stata for Researchers”: <http://www.ssc.wisc.edu/sscc/pubs/sfr-intro.htm> This website should be your first port of call in all things Stata. • Furthermore, the UCLA Institute for Digital Research and Education provides fantastic resources for people who are interested in learning Stata: <http://www.ats.ucla.edu/stat/stata/> Feel free to use these links throughout the semester to improve your Stata skills! 3.3.2 Week 2 Problem Solving 1. Let Yi ∼ i.i.d.(µ, σ2 ). You have learnt in the lecture that µ̂1 := Ȳn is an unbiased and consistent estimator for the population mean µ. Are the following estimators also unbiased or consistent for µ? Discuss! (a) µ̂2 := 42 (‘the answer to everything’ estimator) (b) µ̂3 := Ȳn + 3/n (c) µ̂4 := (Y1 + Y2 + Y3 + Y4 + Y5 )/5 2. Excerpt from the website of the Australian Bureau of Statistics: 3.3. Exercises 13 EMET2008 Course Material, Release 1.0 "The Adult Literacy and Life Skills Survey (ALLS) was conducted in Australia as part of an international study coordinated by Statistics Canada and the Organisation for Economic Co-operation and Development (OECD). The ALLS is designed to identify and measure literacy which can be linked to the social and economic characteristics of people both across and within countries. The ALLS measured the literacy of a sample of people aged 15 to 74 years. The ALLS provides information on knowledge and skills in (among others) *Numeracy*, i.e. the knowledge and skills required to effectively manage and respond to the mathematical demands of diverse situations." In a sample of 1,000 randomly selected Australians, the average numeracy score was 312 and the sample standard deviation was 41. Construct a 95% confidence interval for the population mean of the numeracy score. 3. Exercise 3.3 parts a and b; Exercise 3.4. Computational 1. Continue working through the “Stata for Researchers” website (as started last week). 2. Empirical Exercise E4.2 parts a and b. 3.3.3 Week 3 Problem Solving Consider the following linear model for heights: Yi = β 0 + β 1 Xi1 + ui , where Yi is the height of person i and Xi1 is a gender dummy variable that takes on the value 1 if person i is male and zero otherwise. 1. In that model, what does β 0 capture? What does β 0 + β 1 capture? 2. Define and derive (mathematically) the OLS estimators of β 0 and β 1 . Computational 1. Empirical Exercise E3.1 part d. 2. Empirical Exercise E4.4. The following Stata do-file is a solution to this exercise. Please feel free to use this as the starting point for all your future do-files. Just copy and paste it into your Stata do-file editor and save it as a new do-file. Since it contains the answer to Empirical Exercise 4.4, I gave it the name “E4_4.do”, but you can choose whatever name you want. What’s important here is that you need to customize the code below in one place: • Work directory: This is the location on your computer where your access and store all your files. This includes the textbook’s data files (the files with a dta-suffix), your self-written Stata do-files as well as the log-files that are created by your do-files. You choose the work directory; it is likely different on different computers. For example, on my office desktop computer I created a work directory called /Users/juergen/EMET2008/Stata. For the code below to work, you need to keep ALL files in the same work directory! Again, this includes your textbook’s data files as well as the Stata do-files and log-files that you create.: 14 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 // ==================================================== // PREAMBLE // ==================================================== clear all // clear memory capture log close // close any open log files set more off // don’t pause when screen fills // set work directory (put your own path here!): cd /path/to/location/on/your/computer/where/Stata/files/go log using E4_4.log, replace // open new log-file // ==================================================== // Work on your data set // ==================================================== use "Growth.dta" // loading data set (needs to be in work directory) summarize scatter growth tradeshare regress growth tradeshare margins, at(tradeshare==1) margins, at(tradeshare==0.5) regress growth tradeshare if country_name!="Malta" margins, at(tradeshare==1) margins, at(tradeshare==0.5) log close // close log-file 3.3.4 Week 4 Problem Solving 1. Derive the bias from omitted variables. 2. In a recent applied econometrics research project, I have been interested in the causal effect of academic fraud on labor market outcomes. The broad research question is: Do people who commit academic fraud (at university) benefit significantly from it? Sounds like a straightforward research question, but answering it is quite challenging econometrically. Let’s say the model looks like Yi = β 0 + β 1 Fraudi + β 2 Malei + β 3 Educi + β 4 Agei + ui , where Yi are weekly earnings (full time), Fraudi is a dummy variable that is equal to one if a person reported that s/he committed academic fraud during university and zero otherwise. (All other rhs variables are self-explanatory.) If I run this regression and obtain the estimate β̂ 1 for β 1 , can I interpret this as the causal effect of academic fraud on earnings? Discuss! Computational 1. Empirical Exercise E6.3. 3.3. Exercises 15 EMET2008 Course Material, Release 1.0 Solution: // ==================================================== // PREAMBLE // ==================================================== clear all // clear memory capture log close // close any open log files set more off // don’t pause when screen fills // set work directory (put your own path here!): cd /path/to/location/on/your/computer/where/Stata/files/go log using E6_3.log, replace // open new log-file // ==================================================== // Work on your data set // ==================================================== use "./Stock_data/Growth.dta" drop if (country_name=="Malta") summarize reg growth tradeshare yearsschool rev_coups assasinations rgdp60 margins, atmeans margins, at((mean) _all tradeshare=.771) // checking for heteroskedasticity // 1) pedestrian way predict uhat, res generate uhatsq = uhat^2 regress uhatsq tradeshare yearsschool rev_coups assasinations rgdp60 // null: homoskedasticity // check F-stat // 2) lazy way reg growth tradeshare yearsschool rev_coups assasinations rgdp60 estat hettest, rhs fstat log close // close log-file 2. In EMET2007 you (hopefully!) have learned how to test for homoskedasticity versus heteroskedasticity. How would you do this with Stata? (Use the Growth data set from the previous exercise to illustrate the test.) If you indeed find that the data is heteroskedastic, how would you correct for it with Stata? 3.3.5 Week 5 Problem Solving Consider the simple linear model Yi = β 0 + β 1 Xi + ui . 1. Mathematically define the OLS estimator and prove that it is inconsistent under endogeneity. 2. Mathematically define the TSLS estimator and prove that it is consistent under endogeneity. 3. Which of the two estimators is consistent under exogeneity? 16 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 4. Research question: Do girls who attend girls’ schools do better in math than girls who attend coed schools? I give you a data set that includes the following variables: • score: score in a standardized math test • girlshs: dummy variable which is equal to 1 if a person attended girls’ school or zero otherwise • fecud: father’s education • meduc: mother’s education • hhinc: household income (a) You run an OLS estimation of score on girlshs and all the other variables. Will your OLS estimate of the coefficient on girlshs capture the causal effect of girls’ school on math score? If not, why not? (b) What would be a good instrumental variable for girlshs? Note: this exercise is based on Wooldridge, Introductory Econometrics, A Modern Approach, 5th edition, chapter 15. Computational 1. Empirical Exercise E12.2 (Stock and Watson book) 3.3.6 Week 6 Problem Solving Cool things can be done with randomized control trials. Here I expose you to the work of two recent economics papers published in a top field journal. 1. We will read and discuss the paper on the effects of home computer use on academic achievement of school children (written by Fairlie and Robinson). Here the paper for download (with my annotations): Fairlie Robinson 2013 2. We will read and discuss the paper on the effects of dropping schools by helicopter on rural villages in Afghanistan (written by Burde and Linden). Here the paper for download (with my annotations): Burde Linden 2013 Computational 1. Empirical Exercise E13.1 (Stock and Watson book) 3.3.7 Week 7 Midterm exam 3.3. Exercises 17 EMET2008 Course Material, Release 1.0 3.3.8 Week 8 Problem Solving We will review the midterm exam. In particular: Q1, Q2 and Q5. (The other two questions are easy to answer if you have read the papers.) Computational 1. Empirical Exercise E11.1 (Stock and Watson book) Solution to part (f) twoway function y = _b[_cons] + _b[age] * x + _b[agesq]* x^2 + _b[colgrad], range(18 65) 3.3.9 Week 9 Problem Solving Maximum likelihood estimation of probit and logit coefficients. 1. Define the maximum likelihood estimator. 2. Derive the maximum likelihood estimator. 3. Discuss statistical inference for the probit and logit coefficients. 4. Discuss consistency of the probit and logit estimators. Note: In contrast to the linear probability model (which is a linear model that can be estimated straightforwardly by OLS) the probit and logit models are non-linear (remember that S-shaped curve from the lecture?). Non-linear models are considerably more difficult to estimate. In this problem solving session I will try to explain to you the principle idea and math of maximum likelihood estimation of probit and logit models. In the end, the estimation will need to be done by computers. Luckily, Stata offers a nice set of commands to help out. Computational 1. Empirical Exercise E11.2 (Stock and Watson book) Solution: // ==================================================== // PREAMBLE // ==================================================== clear all // clear memory capture log close // close any open log files set more off // don’t pause when screen fills // set work directory (put your own path here!): cd /path/to/location/on/your/computer/where/Stata/files/go log using E11_2.log, replace // open new log-file // ==================================================== // Work on your data set // ==================================================== 18 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 use "./Stock_data/Smoking.dta" summarize ******* a *********** generate agesq = age^2 probit smoker smkban female age agesq hsdrop hsgrad colsome colgrad black hispanic, robust ******* b *********** * just read off from probit output in part a ******* c *********** test hsdrop hsgrad colsome colgrad ******* d *********** margins, at(smkban=(0 1) female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0 margins, dydx(smkban) at(female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0 ******* e *********** margins, at(smkban=(0 1) female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black= margins, dydx(smkban) at(female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black= ******* f *********** regress smoker smkban margins, at(smkban=(0 margins, dydx(smkban) margins, at(smkban=(0 margins, dydx(smkban) log close female age agesq hsdrop hsgrad colsome colgrad black hispanic, robust 1) female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0 at(female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0 1) female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black= at(female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black= // close log-file 3.3.10 Week 10 Problem Solving We will briefly revisit last week’s problem solving session to summarize ML estimation of probit and logit models. Computational 1. Revisit Empirical Exercise E11.2 (Stock and Watson book) 2. Empirical Exercise E10.1 (Stock and Watson book) (a) Regress lnvio on shall separately for the years 1977 and 1999. What is the causal effect? (b) Run a pooled regression across all years. (c) Can you think of an unobserved variable that varies by state but not across time? How about one that varies across time but not by state? (d) Reshape your data from long format to wide format. Use the reshaped data to create differenced variables (between 1999 and 1977) for lnvio and shall. (e) Run a regression of the differences. What is the causal effect? How does it compare to part (a)? Why should the estimate be different theoretically? 3.3. Exercises 19 EMET2008 Course Material, Release 1.0 For the rest of this exercise, reshape your data back into long format. (Simply reload the original data set.) (f) Run an (n − 1)-binary regressors estimation of lnvio on shall. (g) Run a fixed effects estimation of lnvio on shall. Do it in two different ways: i. Hard way: demean the variables yourself and regress demeaned variables on each other. ii. Lazy way: use Stata’s inbuilt fixed effect estimation command. How do the results differ to part (f)? (h) Add the explanatory variables incarc_rate, density, avginc, pop, pb1064, pw1064, and pm1029 to the estimation. (i) Now also control for time fixed effects. Do it in three different (yet equivalent) ways: i. Entity demeaning with ( T − 1)-binary time indicators ii. Time demeaning with (n − 1)-binary entity indicators iii. ( T − 1)-binary time indicators and (n − 1)-binary entity indicators (j) Redo the main estimation using the logarithms of rob and mur instead of vio as outcome variables. How do your findings change? 3.3.11 Week 11 Problem Solving Define and derive the fixed effect estimator. Computational Continue working on Empirical Exercise E10.1 (Stock and Watson book), see previous week. 3.3.12 Week 12 Problem Solving No problem solving this week. Computational Empirical Exercise E15.1 (Stock and Watson book). (Note: you need to import the Excel spreadsheet that holds the data and save the imported data as a dta-file before you start working.) Solution: // ==================================================== // PREAMBLE // ==================================================== clear all // clear memory capture log close // close any open log files set more off // don’t pause when screen fills 20 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 // set work directory (put your own path here!): cd /path/to/location/on/your/computer/where/Stata/files/go log using E15_1.log, replace // open new log-file // ==================================================== // Work on your data set // ==================================================== use "./Stock_data/USMacro_Monthly.dta" summarize *********** part a generate time = m(1947m1) + _n-1 format t %tm tsset time generate L_IP = L1.IP generate ip_growth = 100 * log(IP/L_IP) summarize if tin(1952m1, 2009m12) *********** part b tsline Oil *********** part c and d generate D_Oil = D.Oil * non-cumulative effects newey ip_growth L(0/18).Oil if tin(1947m1, 2009m12), lag(7) testparm L(0/18).Oil * cumulative effects (see eq. 15.7 in Stock and Watson, 3rd ed) newey ip_growth L(0/17).D_Oil L(18/18).Oil if tin(1947m1, 2009m12), lag(7) ************ part e * do these plots in Excel, Stata is to awkward for plotting this stuff log close // close log-file 3.4 Assignments 3.4.1 Assignment 1 Instructions Answer all questions! This assignment is due at 2.00pm on Thursday, 28 August. It is worth 10% of your final mark for this course. Hand in your work by putting it in the EMET2008/6008 assignment box (HW Arndt Building 25a, opposite of room 1002). Absolutely no extensions will be given, late assignments will receive zero credit. If you have a university approved excuse for not handing in this assignment, then your marks for your final exam will be weighted up by 10% to compensate for the missed work. 3.4. Assignments 21 EMET2008 Course Material, Release 1.0 While I would prefer it if you could provide typed answers, you may also hand in written answers as long as they are legible and easy to follow. (I will not only mark the correctness of your answers but also the clarity of the exposition and the transparency with which you communicate your results.) The work that you hand in should consist of answers to the questions, together with an appendix which contains both the printout of a complete Stata do-file and a log-file (prodced by the do-file) that covers the entire assignment. Answers should be in sentence form (i.e. single word or single number answers without explanation will be considered incomplete), but clarity of presentation is important, so try to make your comments/discussion brief and to the point. Annotated output does not constitute a sufficient answer to any question, but you should highlight those parts of your output that you explicitly use in your answers. If you have any questions regarding to these instructions, do not hesitate to ask me (either during class meetings or consultation times or e-mail.) Exercise 1 Frankel and Rose (for short: FR), in their 2005 paper ‘’Is Trade Good or Bad for the Environment? Sorting Out the Causality.‘’ which is published in The Review of Economics and Statistics (volume 87(1), pages 85-91) empirically address the question: Is globalization good or bad for the environment? In particular, they examine whether countries which are more open to international trade incur more (or less) environmental damage as result, controlling for international variations in real growth rates and in political institutions. FR quantify environmental damage on seven dimensions: SO2 air concentrations, NO2 air concentrations, particulate matter air concentrations, CO2 air concentrations, deforestation, energy resources depletion, and rural clean water access. Their analytic focus is primarily on the SO2 , NO2 , and particulate matter air pollution impacts, however. Overall, FR find that greater trade openness, quantified as Exports + Imports , GDP is actually associated with better environmental outcomes. This result might seem surprising, in view of the: ’’...race-to-the-bottom hypothesis, which says that open countries in general adopt looser standards of environmental regulation, out of fear of a loss in international competitiveness. Alternatively, poor open countries may act as pollution havens, adopting lax environmental standards to attract multinational corporations and export pollution-intensive goods. Less widely recognized is the possibility of an effect in the opposite direction, which we call the gains-from-trade hypothesis. If trade raises income, it allows countries to attain more of what they want, which includes environmental as well as more conventional output. Openness could have a positive effect on environmental quality (even for a given level of GDP per capita) for a number of reasons. First, trade can spur managerial and technological innovation, which can have positive effects on both the economy and the environment. Second, multinational corporations tend to bring clean state-of-the-art production techniques from high-standard source countries of origin to host countries. Third is the international ratcheting up of environmental standards through heightened public awareness. Whereas some environmental gains may tend to occur with any increase in income, whether taking place in an open economy or not, others may be more likely when associated with international trade and investment. Whether the race-to-the-bottom effect in practice dominates the gains-from-trade effect is an empirical question.’’ (modified quotation from FR 2005) In this exercise you will replicate FR’s empirical results for the SO2 measure of air pollution and diagnostically check their regression model, so as to assess the credibility of their statistical inference results. 22 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 Two features of this exercise should be noted at the outset. First, it is worth noting that the FR model is estimated using only 41 sample observations. This is an unusually small sample size for a piece of research in applied econometrics that is published in such a high-quality journal. As we have learned in class, sample estimators will have an approximate normal distribution (justified by the central limit theorem) – the larger the sample size, the better the approximation. For the purpose of this assignment, we will not worry further about the small sample size here. We keep it in the back of our minds but are otherwise happy to use and apply our standard econometric toolkit. Second, we conduct our analysis here under the assumption that all key explanatory variables are exogenous. In particular, real per capita income and trade openness are considered to be determined outside of the model. (We will deviate from that assumption in the next exercise.) Use the Stata file Frankel_Rose.dta (available for download on Wattle). This file contains data on 41 countries, collecting (among others) the following variables: Variable sulfdm inc incsqr pwtopen polity lareapc oecd country Description mean 1990 SO2 (sulfur dioxide) concentration (in micrograms per cubic meter). logarithm of real per capita GDP (from the Penn World Tables 5.6; in 1990 dollars, PPP adjusted) squared value of the logarithm of real per capita GDP. 100· (Imports + Exports)/GDP from the Penn World Tables 5.6. index of democratic (+10) versus autocratic (-10) institutions. logarithm of land area per capita. dummy variable which equals 1 if country is an OECD member country country name 1. Estimate a basic regression model with sulfdm as the outcome variable using the regressors inc, incsqr, pwtopen, polity, lareapc. Interpret your coefficient estimates. What does this say about the impact of trade openness on SO2 concentrations – i.e., on the relative importance of the ‘’race-to-the- bottom” versus ‘’gains-from-trade” hypotheses alluded to earlier? 2. Test the model for heteroskedasticity. Do you conclude that the model is heteroskedastic? (If so, proceed with the heteroskedasticity-corrected version of the model in everything that follows.) In the linear regression model, when you correct for heteroskedasticity, how do your coefficient estimates change (vis-a-vis the model in which you do not correct for homoskedasticity)? What about the standard errors? 3. Explain (in words, not maths) what the R2 of a regression measures. How does the adjusted R2 , denoted R̄2 , differ from this? Using the adjusted R2 statistic, what is the fraction of the sample variation in sulfur dioxide concentration which is explained by these five explanatory variables? By how much does this fraction decrease once the openness variable is dropped from the model? (Note: To have Stata report the value of adjusted R2 , use the command ereturn list after the regress command: adjusted R2 will be listed as e(r2_a).) 4. Produce the scatter plot of sulfdm against the crucial independent variable pwtopen. Can you spot two outliers? Which countries do they correspond to? Are they the driving force behind your estimation results? 5. Estimate a re-specified model, using both the logs of sulfdm and pwtopen. (Recall from your study of the log-log model in EMET2007 that the coefficient on logpwtopen can be interpreted as the elasticity of sulfdm with respect to pwtopen.) How do your conclusions about the openness effect change? 6. Check whether the key coefficient in the model is different for OECD countries. (Note: This exercise is from the book ‘’Fundamentals of Applied Econometrics” by Richard Ashley.) 3.4. Assignments 23 EMET2008 Course Material, Release 1.0 Exercise 2 In Exercise 1 you used OLS to study the relationship between trade openness and sulfur dioxide levels (as a proxy for environmental outcomes). That analysis was done under the assumption that all explanatory variables are exogenous. The actual contribution of the paper by FR is to look deeper and examine the causal relationship between trade openness and environmental quality while both controlling for income and appropriately dealing with the likely endogeneity of both income and trade openness. To that end they used several instrumental variables to deal with these two explanatory variables. You will replicate some of these results in the current exercise. Use the Stata file Frankel_Rose.dta. This file contains data on 41 countries, collecting (in addition to the variables mentioned in the previous exercise) the following instrumental variables: Instrument IV Description for trade_potential pwtopen Trade potential of a country. This variable combines information on a country’s geographical location (number of neighbor countries, access to sea, landlock status), population size, land area and language to construct a measure of potential trade. For example, all else equal, a country with access to the sea will have a higher trade potential than a country that is landlocked. This IV is notably correlated with the endogenous regressor pwtopen while plausibly uncorrelated with environmental outcomes. inc_exog inc Exogenous income of a country. While per capita income inc is likely endogenous, it contains some exogenous components. FR combine information on a country’s lagged income as well as school attainment to construct the exogenous component of income. For example, all else equal, a country with higher average school attainment will have higher per capita income than a country with lower average school attainment. This IV is notably correlated with the endogenous regressor inc while plausibly uncorrelated with environmental outcomes. inc_exogsqr incsqrSince their model specification also includes the square of the logarithm of real per capita GDP (incsqr), FR also define inc_exogsq as the square of inc_exog and use this as an instrument for incsqr. 1. Re-estimate the basic model from Exercise 1) part a) using instrumental variables estimation instead. Use all three instruments and make your estimation robust to heteroskedasticity. What does this say about the impact of trade openness on SO2 concentrations – i.e., on the relative importance of the ‘’race-to-the- bottom” versus ‘’gains-from-trade” hypotheses alluded to earlier? 2. Examining the first-stage regressions, do all three first-stage models have reasonably high adjusted R2 values? Do you need to be concerned about weak instruments? (Use the ‘Rule of Thumb’ explained in the textbook, section 12.3.) 3. Using the insights gained from Exercise 1, re-estimate the model, replacing sulfdm and pwtopen by their logarithms, logsulfdm and logpwtopen. How do your conclusions about the openness effect change? 4. Test whether the OLS and 2SLS coefficient estimates are significantly different. (Hint: use the Hausman test; in Stata type help: hausman to learn how to use it. Provide a brief explanation of what the Hausman test does.) 5. In conclusion to Exercises 1 and 2, what is your answer to the question Is globalization good or bad for the environment? What are the strengths and weaknesses of the econometric analysis conducted here? Do you see any possible extensions that could help improve your research? (Note: This exercise is from the book ‘’Fundamentals of Applied Econometrics” by Richard Ashley.) 24 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 Exercise 3 In the research paper ‘’Does Size Matter in Australia’‘, published in The Economic Record (Vol. 86, No. 272, March 2010, pp.71-83), Michael Kortt and Andrew Leigh address the research question: Do taller and slimmer workers earn more? To that effect, they consider the following linear model: Wi = β 0 + β 1 Heighti + β 2 BMIi + β 3 Xi3 + · · · + β k Xik + ui . (This equation is my version of equation (1) on page 73 of their paper.) Here, Wi is the log hourly wage of person i, Heighti represents a person’s height and BMIi stands for a person’s body mass index. The remaining regressors, Xi3 , . . . , Xik capture a person’s demographic characteristics, including gender, age (linear and quadratic) and education. Obtain a copy of the paper (available online for ANU students and faculty) and answer the following questions. 1. Kortt and Leigh begin the analysis by estimating all coefficients by OLS. Summarize their OLS results regarding the two main coefficients of interest, β 1 and β 2 (for height and BMI). 2. Would you interpret these estimates as causal? What are the main endogeneity problems in this regression? 3. Explain how Kortt and Leigh attempt to address the endogeneity problem using instrumental variables. How do their findings change? 4. What is the main conclusion of the paper? Do taller and slimmer workers in Australia earn more? What is the evidence from other countries? 3.4.2 Assignment 2 Instructions Answer all questions! This assignment is due at 2.00pm on Wednesday, 29 October. It is worth 10% of your final mark for this course. Hand in your work by putting it in the EMET2008/6008 assignment box (HW Arndt Building 25a, opposite of room 1002). Absolutely no extensions will be given, late assignments will receive zero credit. If you have a university approved excuse for not handing in this assignment, then your marks for your final exam will be weighted up by 10% to compensate for the missed work. While I would prefer it if you could provide typed answers, you may also hand in written answers as long as they are legible and easy to follow. (I will not only mark the correctness of your answers but also the clarity of the exposition and the transparency with which you communicate your results.) The work that you hand in should consist of answers to the questions, together with an appendix which contains both the printout of a complete Stata do-file and a log-file (prodced by the do-file) that covers the entire assignment. Answers should be in sentence form (i.e. single word or single number answers without explanation will be considered incomplete), but clarity of presentation is important, so try to make your comments/discussion brief and to the point. Annotated output does not constitute a sufficient answer to any question, but you should highlight those parts of your output that you explicitly use in your answers. If you have any questions regarding to these instructions, do not hesitate to ask me (either during class meetings or consultation times or e-mail.) 3.4. Assignments 25 EMET2008 Course Material, Release 1.0 Exercise 1 The data set PNTSPRD (available on Wattle) contains information from the Las Vegas sport betting market. The overarching research question is whether the favorite team is more likely to win the game. Consider the linear probability model Pr( f avwin = 1|spread) = β 0 + β 1 spread, where spread is a proxy for the favorite team. A high point spread means that a team is the favorite. Here a quick primer on point spread betting from Wikipedia: The general purpose of spread betting is to create an active market for both sides of a binary wager []. If the wager is simply "Will the favorite win?", more bets are likely to be made for the favorite, possibly to such an extent that there would be very few betters willing to take the underdog. The point spread is essentially a handicap towards the underdog. The wager becomes "Will the favorite win by more than the point spread?" The point spread can be moved to any level to create an equal number of participants on each side of the wager. This allows a bookmaker to act as a market maker by accepting wagers on both sides of the spread. The bookmaker charges a commission, or vigorish, and acts as the counterparty for each participant. As long as the total amount wagered on each side is roughly equal, the bookmaker is unconcerned with the actual outcome; profits instead come from the commissions. (excerpt taken on October 7, 2014) 1. Explain why, if the spread incorporates all relevant information, we expect β 0 = 0.5? 2. Estimate the linear probability model. Test the hypothesis β 0 = 0.5 against a two-sided alternative. (Make all estimations robust to heteroskedasticity throughout this entire exercise.) 3. Is spread statistically significant? What is the estimated probability that the favored team wins when spread = 10? 4. Now estimate the model by probit. Interpret and test the hypothesis that the intercept is equal to 0.5? 5. Use the probit model to estimate the probability that the favored team wins when spread = 10. Compare this with the linear probability model. 6. Add the variables favhome, fav25, and und25 to the probit model and test joint significance of these variables. 7. Redo parts (d), (e), and (f) using the logit model. 8. Which sport is this exercise about? (Note: This exercise is from the book ‘’Introductory Econometrics: A Modern Approach” by Jeffrey Wooldridge) Exercise 2 Krueger and Maleckova, in their paper ‘’Education, Poverty and Terrorism: Is There a Causal Connection?’‘, published in the Journal of Economic Perspectives (2003), attempt to estimate the causal effect of education and poverty on terrorism. 1. What is the main research question of the paper? 2. What econometric method do they use to estimate causal effects? 3. What is the main outcome variable? 4. What are the main explanatory variables? 26 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 5. What other explanatory variables do they include? 6. What is their main finding? 7. What problems/shortcomings do you see in their research? Exercise 3 The data set Airfare (available on Wattle) contains information on airfares, passenger volume, flight distance and market concentration for 1,149 flight routes (connections) for the years 1997 through 2000. The overarching research question is whether increased competition reduces air fares. (Do connections with less market concentration have cheaper prices?) The data set contains the following variables: Variable year id dist passen fare bmktshr Description year: 1997, 1998, 1999, 2000 route identifier (the subject of analysis are flight routes) distance of flight route (in miles) average number of passengers per day average one way airfare, $ market share, biggest carrier (proxy variable for market concentration) The main explanatory variable is bmktshr. A higher value of bmktshr implies higher market concentration on that route and therefore less competition. Consider the following linear model: log( f are)it = ηt + β 1 bmktshrit + β 2 log(dist)it + β 3 [log(dist)it ]2 + αi + uit , where ηt means that we allow for different year intercepts. 1. Estimate the above linear model separately for all four years. If ∆bmktshr = 0.1, what is the estimated percentage increase in fare? (Make all estimations robust to heteroskedasticity throughout this entire exercise.) 2. Run pooled OLS across all years, i.e. treat the data as if it were one big regression and control for years by including year dummies. What is your estimate of β 1 ? Is it significant? 3. For what value of dist does the relationship between log(fare) and dist become positive? 4. Estimate the linear model using fixed effects. What is the fixed effect estimate of β 1 ? 5. Add the logarithm of passen to the model. How do your estimates change? In summary, does higher concentration (i.e, higher bmktshr) on a route increase air fares? What is your best estimate? 6. Name two characteristics of a route (other than distance) that are captured by αi and that are correlated with bmktshr. (Note: This exercise is from the book ‘’Introductory Econometrics: A Modern Approach” by Jeffrey Wooldridge) 3.5 Illustration of Central Limit Theorem using Monte Carlo Simulation The principal problem in econometrics is that we want to learn something about the unknown population distribution. For example, we want to know mean heights of Australians. In practice, we can never know the true population mean; instead we make statistical inferences about the population mean based on one 3.5. Illustration of Central Limit Theorem using Monte Carlo Simulation 27 EMET2008 Course Material, Release 1.0 random sample of size n that is drawn from the population. We have learnt that a good estimator of the population mean is the sample average Ȳn . We have also learned that the sample average itself is a random variable. If you draw more than one random sample from the population you are likely to obtain different estimates of the population mean when computing the sample average. The central limit theorem helps us understand what the approximate distribution of the sample average looks like. To illustrate the CLT we use Monte Carlo simulation. Here is a brief excerpt from Wikipedia explaining the term: "Monte Carlo [Simulations] are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; typically one runs simulations many times over in order to obtain the distribution of an unknown probabilistic entity. The name comes from the resemblance of the technique to the act of playing and recording results in a real gambling casino. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to obtain a closed-form expression, or infeasible to apply a deterministic algorithm. Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration and generation of draws from a probability distribution." (excerpt taken on 30 July 2014) Monte Carlo simulations are run on computers that are able to quickly calculate thousands (millions) of sample averages for as many different samples. In an MC simulation we pretend to know what the distribution of Yi in the population is: we generate an artificial population from which we will draw many many different random samples and we then compute many many different sample averages (for each of the random samples). We are then able to visualize the distribution of Ȳn by simply looking at a histogram of the different sample averages. To be specific, let’s assume that the population values Yi are actually exponentially distributed with λ = 1. (Using the exponential distribution is only an example. We could choose any statistical distribution here, the CLT would still apply.) If you (vaguely) recall the properties of the exponential distribution, this implies that the population mean µ is equal to 1 and the population variance σ2 is also equal to 1. If we compute one random sample of size n, the CLT would therefore suggest the following approximate distribution: Ȳn ∼ N (1, 1/n) In an MC simulation we are in the luxurious position to create an artificial population based on the exponential distribution of, say, 1,000,000 members. We then draw 10,000 random samples of size n (which can take on the values 1, 5, 10, 30, 100 in the pictures below) from that population and plot the histogram. As you can see in the plots below, as the sample size increases from 1 to 100, the distribution resembles more and more that of a normal distribution. 28 Chapter 3. Course Material EMET2008 Course Material, Release 1.0 Next, instead of studying the approximate distribution of Ȳn , we standardize the distribution and thus study Ȳn − µ Ȳn − 1 = ∼ N (0, 1) σ/n 1/n It is then easier to superimpose the pdf of the standard normal distribution which can then be directly compared to the histograms. The CLT says that the histograms should get closer and closer to the pdf of the standard normal distribution (the dashed line) as the sample size grows from 1 to 5 to 10 to 30 to 100. 3.5. Illustration of Central Limit Theorem using Monte Carlo Simulation 29 EMET2008 Course Material, Release 1.0 This little MC simulation confirms the CLT and it also shows us that sample sizes do not necessarily need to be very large for the sample average to have a normal distribution. In practice, a sample size of 30 seems sufficiently large for that purpose. I hope you are convinced now that the CLT really ‘works’. The question remains, how do we use the CLT theorem for practical purposes? 30 Chapter 3. Course Material