GETTING STARTED WITH STATA Sébastien Fontenay ECON - IRES THE SOFTWARE Software developed in 1985 by StataCorp Functionalities › Data management › Statistical analysis › Graphics Using Stata at UCL › Computer labs Socrate 30, 31-32, 33, 34, 54 and 68 Dupriez 143 Leclercq 74, 76, 77 and 78 › Student licence to install on your personal computer valid during all your studies at the price of 20 euros www.uclouvain.be/438229.html FINDING SUPPORT (1) Best documentation › help command › search keyword Stata website : www.stata.com/support › Frequently Asked Questions › Video tutorials › Statalist Books › Cahuzac, E., Bontemps, C. (2008). Stata par la pratique: Statistiques, graphiques et éléments de programmation. › Cameron, A.C., Trivedi, P.K. (2009). Microeconometrics using Stata. › Becketti, S. (2013). Time series using Stata. UCLA : www.ats.ucla.edu/stat/stata FINDING SUPPORT (2) For all your questions related to data management or analysis using Stata › Website: http://www.uclouvain.be/411370 › Email: sebastien.fontenay@uclouvain.be › By appointment only: • Bâtiment Dupriez (office d010), 3 place Montesquieu COURSE Quick tour of Stata • Working environment • Writing commands Data management • Inputting data • Transforming data Data analysis • Descriptive statistics • Linear regression • Exporting results TOPICS SECTION 1 QUICK TOUR OF STATA • Working environment • Writing commands WORKING ENVIRONMENT The working environment is composed of 5 windows › Results of commands › Variables • list and labels › Review › Properties • of commands • of variables and dataset › Command • window WORKING ENVIRONMENT Three specific windows can be opened by clicking on the following icons › Data editor/browser • Display data in memory › Viewer • Display log and help files › Do-file Editor • Text editor to save/execute commands There are 3 main types of files used in Stata › .dta data › .do commands (do-file) › .smcl | .log output (log file) WORKING Data All software functionalities are available from the dropdown menus › Useful when you are unsure of commands to run or unfamiliar with available options Every command issued in this manner is echoed to the review and results windows › e.g. sysuse auto.dta ENVIRONMENT Graphics Statistics WORKING ENVIRONMENT In order to use Stata effectively, you should always follow this three-step process: › Open a do-file › Choose your working directory • cd "C:\Users\Me" • mkdir stata_training • cd stata_training - You can see the current working directory at the bottom left of the main window › Start a log file (saving commands and their output) • log using filename [, text append replace] - log close log off | on SECTION 1 QUICK TOUR OF STATA • Working environment • Writing commands WRITING COMMANDS Stata commands use a common syntax: [prefix :] command [varlist] [= exp] [if] [in] [, options] • The square brackets denote qualifiers that are optional • Italicized words are to be substituted by the user › varlist denotes a list of variables › exp is a mathematical expression Stata is case sensitive! (i.e. UPPERCASE != lowercase) WRITING COMMANDS Operators may be used to manipulate numerical or string variables Arithmetic Logical Relational + * / ^ & | ! ~ > < >= <= == ~= != addition subtraction multiplication division raised to power and or not not greater than less than > or equal < or equal equal not equal not equal Pay attention that a double equal sign (==) is used for equality testing WRITING COMMANDS Logical and relational operators are particularly useful with if qualifiers to define the sample for analysis The if qualifier at the end of a command means the command is to use only the data specified › command if exp • • • • list list list list make if foreign==1 if make=="Volvo 260" make price if price>=5000 & price<=7000 make price if price<5000 | price>7000 Note that character strings are enclosed in double quotes WRITING COMMANDS You can refer to a list of numbers using the following shorthand 1/30 1/l f/-5 -5/l 1 to 30 1 until last number first to 5th number before the end last five numbers Numlists are particularly useful with the in qualifiers to specify a range of observations to be used › command in range • list in f/10 • list in -10/l • list make price in 74 WRITING COMMANDS The by prefix repeats execution of a command on subsets of the data › subsets are groups of observations that take the same value in a given variable (often a categorical variable) • by varname: command - by foreign: list make › If the dataset is not sorted, you should use the bysort prefix instead • bysort varname: command SECTION 2 DATA MANAGEMENT • Inputting data • Transforming data INPUTTING DATA To open a dataset in Stata format (.dta): use › use filename [, clear] • sysuse - open example datasets installed with Stata To save a dataset in Stata format: save › save filename [, replace] Stata can also import/export Excel files (.xls or .xlsx) › import excel filename [, firstrow] › export excel filename [, firstrow(variables)] By default, Stata opens/saves a dataset from/in the current working directory but you can specify › another directory: use | save "C:\Users\Me\Stata_training\dataset.dta" › a web address: use http://sites.uclouvain.be/datasupport/data/wage.dta INPUTTING Summary of the dataset › describe: information on dataset in memory › codebook: detailed description of variables Further explore data in memory › count: number of observations › list: display data in the results window Manipulate variables/observations › keep wage educ exper › drop in 1/10 › sort wage DATA SECTION 2 DATA MANAGEMENT • Inputting data • Transforming data TRANSFORMING DATA To create a new variable: generate › generate newvar = exp [if] [in] • exp may be a number, a character string or a mathematical function • generate constant = 1 - Create a constant equal to 1 • generate constant_text = "text" - Create a constant that contains the character string "text" • generate logwage = ln(wage) - Create a variable equal to the natural logarithm of wage • generate expersq = expr^2 - Create a variable equal to the square of exper TRANSFORMING DATA To create specific variables using time series operators › generate lag_gdp = L.gdp • Create a variable corresponding to the first lag of gdp › generate lead_gdp = F.gdp • Create a variable corresponding to the first lead of gdp › generate diff_gdp = D.gdp • Create a variable corresponding to the first difference of gdp But before you should tell Stata that you are working with time series data using the command: tsset › tsset time [, yearly monthly quarterly daily] Using system variables › generate gdp_growth = ((gdp[_n] - gdp[_n-1]) / gdp[_n-1])*100 • Create a variable equal to the growth rate of gdp TRANSFORMING DATA To modify an existing variable: replace › replace wage=20 if wage>=20 To rename an existing variable: rename › rename wage hourly_wage You can also add a brief description to the variable using labels › label variable educ "total years of education" TRANSFORMING DATA When transforming data, one must be careful with missing values › Missing values in Stata are coded with a . (period) Stata treats missing values as large numbers, higher than any other values of a given variable › In certain cases you should use the if qualifier to exclude missing values • generate rich = (wage>15) if wage<. |or| • generate rich = (wage>15) if wage!=. |or| • generate rich = (wage>15) if !missing(wage) SECTION 3 DATA ANALYSIS • Descriptive statistics • Linear regression • Exporting results DESCRIPTIVE STATISTICS Categorical variables › One-way table of frequencies • tabulate female - The option [, missing] displays the total frequency of missing observations › Two-way table of frequencies • tabulate female married Continuous variables › summarize gives the number of observations, the mean, the standard deviation, the minimum and maximum values • summarize wage educ - The option [, detail] displays the main quantiles, the highest and lowest five values, the variance, as well as the skewness and kurtosis measures Pearson’s correlation coefficient › correlate varlist [, covariance] DESCRIPTIVE STATISTICS Exploring data with graphs › Distribution of a continuous variable: histogram • histogram wage - the option [, normal] draws a normal density line on the plot › Scatter plot between two variables: scatter • scatter wage educ › Evolution of time series: tsline • tsline gdp - available only after tsset SECTION 3 DATA ANALYSIS • Descriptive statistics • Linear regression • Exporting results LINEAR REGRESSION We seek to estimate the relationship between one dependent variable and a set of independent variables › using the Ordinary Least Squares (OLS) estimator Classical linear model assumptions (Wooldridge, 2008): › › › › › › Model is linear in parameters Data are random sample of the population No perfect collinearity between independent variables Zero conditional mean of error term Homoskedasticity Normality of the residuals LINEAR REGRESSION The model we want to estimate: › log(wage) = 𝛽0 + 𝛽 1education + 𝛽 2experience + 𝛽3tenure + u • where: - wage is average hourly earnings in dollars - education is the number of years of education - experience is the number of years of labour market experience - tenure is the number of years with the current employer In Stata: › regress logwage educ exper tenure LINEAR Stata output REGRESSION LINEAR Analysis of variance › Sum of Squares (SS) • Explained variance (model) • Residual variance • Total variance › Degrees of freedom (df) › Mean Squares (MS) • SS divided by df REGRESSION LINEAR REGRESSION Overall model fit › Number of observations › F-statistic › p-value associated with the F-statistic • testing the null hypothesis that all of the model coefficients are 0 › R-squared • proportion of variance in the dependent variable explained by the independent variables - SS(model) divided by SS(total) › Adjusted R-squared › Standard deviation of the error term • 𝑀𝑆(𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) LINEAR REGRESSION Parameters estimates › Dependent variable (1) › Independent variables and intercept (2) › Coefficients (3) › Standard-errors (4) › t-statistics (5) › p-values associated with the t-statistics (6) • testing the null hypothesis that a given coefficient is 0 › 95% confidence intervals (7) (1) (2) (3) (4) (5) (6) (7) LINEAR REGRESSION Predicting fitted values and residuals › predict wage_fitted • e.g. 1,304921 = 0,2843595 + 11*0,092029 + 2*0,0041211 + 0*0,0220672 › predict wage_resid, r • e.g. -0,1735185 = 1,131402 – 1,304921 logwage educ exper tenure wage_fitted wage_resid 1 1,131402 11 2 0 1,304921 -0,1735185 2 1,175573 12 22 2 1,523506 -0,3479329 3 1,098612 11 2 0 1,304921 -0,2063083 4 1,791759 8 44 28 1,819802 -0,0280429 5 1,667707 12 7 2 1,461690 0,2060172 6 2,169054 16 9 8 1,970451 0,1986027 7 2,420368 18 15 7 2,157168 0,2631997 8 1,609438 12 5 3 1,475515 0,1339233 9 1,280934 12 26 4 1,584125 -0,3031912 10 2,900322 17 22 21 2,402928 0,4973939 LINEAR REGRESSION Incorporating categorical information into regression models Dummy variables (coded as 0/1) can be included as such in the regression › regress wage educ exper tenure female Categorical variables with more than two categories must be included using the i. prefix › regress wage educ exper tenure i.region • Stata will automatically create dummy variables for each category and incorporate them in the regression except the reference category - You can use the prefix ib(x). instead to change the reference category LINEAR REGRESSION Post-estimation tests › Multicollinearity (Wooldridge, 2008 - chapter 3, p99) • estat vif - Rule of thumb, if variance inflation factor>10, multicollinearity problem › Normality of the residuals • sktest varname - testing the null hypothesis that variable follows a standard normal distribution • swilk|sfrancia varname - Shapiro-Wilk and Shapiro-Francia test › Homoskedasticity (Wooldridge, 2008 - chapter 8) • estat hettest - Breusch-Pagan test, testing the null hypothesis of homoskedasticity • estat imtest, white - White test, testing the null hypothesis of homoskedasticity • The [, robust] option after regress gives heteroskedasticity-robust standard errors › F-test: testing that a group of variables has no effect on the dependent variable – joint hypotheses test (Wooldridge, 2008 - chapter 4, p143) • test var1 var2 SECTION 3 DATA ANALYSIS • Descriptive statistics • Linear regression • Exporting results EXPORTING RESULTS outreg2 allows to easily export the results of one or several regressions › to Microsoft Office applications: Word, Excel › to LaTeX outreg2 [estlist] using filename [, word excel tex] › [estlist] refers to the list of estimation results previously saved using the command: estimates store estname EXPORTING regress logwage educ estimates store est1 regress logwage educ exper tenure estimates store est2 regress logwage educ exper tenure female estimates store est3 outreg2 [est1 est2 est3] using output, word RESULTS