Sacha Kapoor - Masters Metrics 1 Address: Max Gluskin House, 150 St.George, Rm 329 Email: sacha.kapoor@utoronto.ca Web: http://individual.utoronto.ca/sacha$_$kapoor 1 Basics Where do you get data? Erasmus has a data service center. The center gives students access to various data sets. It’s website is found here: : • EDSC: http://www.eur.nl/ub/en/edsc/ The EDSC is really helpful for downloading data, and for help with the data you need. You can contact the EDSC data team for help with your data needs. There are many different types of data: • Financial markets data: – CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies – Canadian Financial Markets Research Centre Toronto stock exchange trading info about specific securities Fundata Mutual Fund Database • Companies financial data: – Financial Post Corporate Database – COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies • National income statistics: – OECD National Accounts Database – World Bank databases – Penn World Tables It is trend for economic journals to post the data used in the articles they publish. You should make use of these websites. It is a great source for data that you can use in your thesis. For example, the American Economic Association publishes several journals that have empirical articles. Visit this website https://www.aeaweb.org/, click on Journals, then on American Economic Review or Americal Economic Journal: Economic Policy or American Economic Journal: Applied Economics. Look through the papers at these journals. Many will have a data folder attached to their paper. In the data folder you can find a Readme file. That file should tell you more about the availability of the data. If the data is unavailable, the Readme file will tell you. Otherwise you can assume that it is available. What is Stata? • A high level general purpose statistical software package (built on a C environment), with lots of built in functions. – Caveat: Functions are not substitutes for understanding. Sacha Kapoor - Masters Metrics 2 • 3 ways to use Stata: – Interactively, through the command prompt (enter the commands one by one). – Batch files, by collecting commands and running them all at once. – Point and Click. How to collect commands? Use a do file. doedit To track results/output you should use a log file: cd ../../../../Documents/TA/2010-2011/Masters_Metrics log using "tutorial_091610.log", replace where the first command changes the working directory to the data location and the second command opens the log file. To examine the current working directory: dir To import comma delimited data (.csv) use the insheet command: insheet using "S&P_data.csv" To examine attributes of the data: des Another way to obtain the same information and more: edit Note that in Stata 11, as opposed to previous versions, you can run commands and have the editor open at the same time. Before proceeding label the data and variables: label data "S&P (01-31-80 to 12-31-99)" describe label variable eps "Earnings per share" label variable price "Price per share" label variable weather "Weather" describe To convert the data into Stata format: save sacha_S&P.dta, replace To import data already in Stata format use the ‘use’ command: clear use sacha_S&P.dta To destring the date variable, let’s try: Sacha Kapoor - Masters Metrics 3 destring date, replace list date in 1/10 destring date, force replace list date in 1/10 Two issues: 1. missing data; 2. proper command for destringing dates. To deal with the first problem take the necessary precautions in your preamble: clear use "sacha_S&P.dta" preserve destring date, force replace list date in 1/10 edit restore list date in 1/10 To deal with the second problem: generate date2 = date(date,"MDY") list date2 in 1/10 Now let’s tell Stata that this is a time series: tsset date2, monthly To extract more detailed date information: generate year = year(date2) generate month = month(date2) generate day = day(date2) label variable year "Year" label variable month "Month" label variable day "Day" describe list in 1/10 To drop variables: preserve drop day To keep variables: keep year To drop observations 5 through 15. Sacha Kapoor - Masters Metrics 4 drop in 5/15 Let’s restore the data: restore Still on the topic of time series data, to generate a trend: generate x = _n list x in 1/10 To generate lags (for x): generate x_1 = x[_n-1] replace x_1=0 if x_1==. Let’s take a closer look at the weather variable. des weather edit weather One way to turn this into a dummy variable: generate weather2 = 0 replace weather2=1 if weather =="yes" replace weather2=0 if weather =="no" list weather2 in 1/10 Notice how the replace command conditions on a logical expression. For future reference conditional statements can involve any one of the following: • <, less than • >, greater than • <=, less than or equal to • >=, greater than or equal to • ==, equal to in a logical expression • ∼=, not equal to in a logical expression 2 Some Basic (Mostly) Statistical Commands To check the current memory allocation: help memory To set a new allocation: set memory 100 Sacha Kapoor - Masters Metrics 5 Note that the set command can be used to change many basic defaults in Stata. I always begin investigations with the following command: tabulate weather Why is it nonsensical to tabulate price? tabulate price To present continuous data: histogram price An even better way: histogram price, kdensity Compare this with: histogram eps, kdensity Coarser evidence is obtained with the following command: summarize price eps To include a summary of a categorical variable we can use the ‘xi’ environment: xi: summarize price eps i.weather To calculate means for price and eps under good and bad weather: by weather, sort: summarize price eps To summarize a subset of values: summarize price if price <=150 To collapse the data and create a new dataset: preserve collapse(mean) price, by (weather) describe save "price.dta", replace restore des To test the hypothesis that price=150, with 95 percent confidence: ttest price=150, level(95) To test the equality of means: gen price_g = price if weather2==1 gen price_b = price if weather2==0 ttest price_g = price_b, unequal unpaired Sacha Kapoor - Masters Metrics 3 Regression Suppose our interest is in the relationship between price and eps: twoway(scatter price eps) twoway(scatter price eps) || lfit price eps Fitting a line through these points is equivalent to: regress price eps Controls are easy to add: regress price eps x The ‘xi’ environment works here as well: xi: regress price eps x i.weather One way to deal with persistence in the dependent variable: generate price_1 = price[_n-1] xi: regress price eps x i.weather price_1 4 Merging Data Sets Let’s access online data from Stata.com webuse odd list webuse even1 list Merges can be one-to-one merge using http://www.stata-press.com/data/r10/odd list or can match observations across datasets clear webuse even1, clear merge number using http://www.stata-press.com/data/r10/odd, sort list 6 Sacha Kapoor - Masters Metrics 5 7 Loops Let’s generate data: clear set obs 100 To create a variable with draws from a uniform distribution: generate y = runiform() list y in 1/10 To generate many variables with draws from the uniform distribution: forvalues i = 1(1)100{ generate x‘i’ = runiform() } Note: (1) gives the increment, the loop generates 100 uniform random variables over (0,1). To check for consistency of an estimator: clear webuse census2, clear generate x = rnormal(1000,100) generate e = rnormal() list x e in 1/10 generate y = 100+1*x + e regress y x 6 Panel Data clear use "MATHPNL.DTA" des Tell Stata you have a panel: xtset distid year To run regressions using panel data: xtreg math4 y93 y94 y95 y95 y96 y97 y98 lrexpp lenrol lunch, fe xtreg math4 y93 y94 y95 y95 y96 y97 y98 lrexpp lenrol lunch, fe robust To obtain predictions for the dependent variable and residuals, respectively: predict yhat predict resid To compare predictions with actual values: edit yhat math4 To close the log file: log close