Linear regression

advertisement
GETTING
STARTED WITH
STATA
Sébastien Fontenay
ECON - IRES
THE
SOFTWARE
 Software developed in 1985 by StataCorp
 Functionalities
› Data management
› Statistical analysis
› Graphics
 Using Stata at UCL
› Computer labs



Socrate 30, 31-32, 33, 34, 54 and 68
Dupriez 143
Leclercq 74, 76, 77 and 78
› Student licence to install on your personal computer

valid during all your studies at the price of 20 euros

www.uclouvain.be/438229.html
FINDING
SUPPORT (1)
 Best documentation
› help command
› search keyword
 Stata website : www.stata.com/support
› Frequently Asked Questions
› Video tutorials
› Statalist
 Books
› Cahuzac, E., Bontemps, C. (2008). Stata par la pratique: Statistiques,
graphiques et éléments de programmation.
› Cameron, A.C., Trivedi, P.K. (2009). Microeconometrics using Stata.
› Becketti, S. (2013). Time series using Stata.
 UCLA : www.ats.ucla.edu/stat/stata
FINDING
SUPPORT (2)
 For all your questions related to data management or analysis using
Stata
› Website: http://www.uclouvain.be/411370
› Email: sebastien.fontenay@uclouvain.be
› By appointment only:
• Bâtiment Dupriez (office d010), 3 place Montesquieu
COURSE
Quick tour of
Stata
• Working environment
• Writing commands
Data
management
• Inputting data
• Transforming data
Data analysis
• Descriptive statistics
• Linear regression
• Exporting results
TOPICS
SECTION 1
QUICK TOUR OF STATA
• Working environment
• Writing commands
WORKING
ENVIRONMENT
 The working environment is composed of 5 windows
› Results

of commands
› Variables
• list and labels
› Review
› Properties
• of commands
• of variables and
dataset
› Command
• window
WORKING
ENVIRONMENT
 Three specific windows can be opened by clicking on the following icons
› Data editor/browser
• Display data in memory
› Viewer
• Display log and help files
› Do-file Editor
• Text editor to save/execute commands
 There are 3 main types of files used in Stata
› .dta  data
› .do  commands (do-file)
› .smcl | .log  output (log file)
WORKING
Data
 All software functionalities
are available from the dropdown menus
› Useful when you are unsure
of commands to run or
unfamiliar with available
options
 Every command issued in
this manner is echoed to the
review and results windows
›
e.g. sysuse auto.dta
ENVIRONMENT
Graphics
Statistics
WORKING
ENVIRONMENT
 In order to use Stata effectively, you should always follow this
three-step process:
› Open a do-file
› Choose your working directory
• cd "C:\Users\Me"
• mkdir stata_training
• cd stata_training
-
You can see the current working directory at the bottom left of the main window
› Start a log file (saving commands and their output)
• log using filename [, text append replace]
-
log close
log off | on
SECTION 1
QUICK TOUR OF STATA
• Working environment
• Writing commands
WRITING
COMMANDS
 Stata commands use a common syntax:
[prefix :] command [varlist] [= exp] [if] [in] [, options]
• The square brackets denote qualifiers that are optional
• Italicized words are to be substituted by the user
› varlist denotes a list of variables
› exp is a mathematical expression
 Stata is case sensitive! (i.e. UPPERCASE != lowercase)
WRITING
COMMANDS
 Operators may be used to manipulate numerical or string variables
Arithmetic
Logical
Relational
+
*
/
^
&
|
!
~
>
<
>=
<=
==
~=
!=
addition
subtraction
multiplication
division
raised to power
and
or
not
not
greater than
less than
> or equal
< or equal
equal
not equal
not equal
 Pay attention that a double equal sign (==) is used for equality
testing
WRITING
COMMANDS
 Logical and relational operators are particularly useful with if
qualifiers to define the sample for analysis
 The if qualifier at the end of a command means the command is to
use only the data specified
› command if exp
•
•
•
•
list
list
list
list
make if foreign==1
if make=="Volvo 260"
make price if price>=5000 & price<=7000
make price if price<5000 | price>7000
 Note that character strings are enclosed in double quotes
WRITING
COMMANDS
 You can refer to a list of numbers using the following shorthand
1/30
1/l
f/-5
-5/l
1 to 30
1 until last number
first to 5th number before the end
last five numbers
 Numlists are particularly useful with the in qualifiers to specify a
range of observations to be used
› command in range
• list in f/10
• list in -10/l
• list make price in 74
WRITING
COMMANDS
 The by prefix repeats execution of a command on subsets of the data
› subsets are groups of observations that take the same value in a given
variable (often a categorical variable)
• by varname: command
-
by foreign: list make
› If the dataset is not sorted, you should use the bysort prefix instead
• bysort varname: command
SECTION 2
DATA MANAGEMENT
• Inputting data
• Transforming data
INPUTTING
DATA
 To open a dataset in Stata format (.dta): use
› use filename [, clear]
• sysuse - open example datasets installed with Stata
 To save a dataset in Stata format: save
› save filename [, replace]
 Stata can also import/export Excel files (.xls or .xlsx)
› import excel filename [, firstrow]
› export excel filename [, firstrow(variables)]
 By default, Stata opens/saves a dataset from/in the current working
directory but you can specify
› another directory: use | save "C:\Users\Me\Stata_training\dataset.dta"
› a web address: use http://sites.uclouvain.be/datasupport/data/wage.dta
INPUTTING
 Summary of the dataset
› describe: information on dataset in memory
› codebook: detailed description of variables
 Further explore data in memory
› count: number of observations
› list: display data in the results window
 Manipulate variables/observations
› keep wage educ exper
› drop in 1/10
› sort wage
DATA
SECTION 2
DATA MANAGEMENT
• Inputting data
• Transforming data
TRANSFORMING
DATA
 To create a new variable: generate
› generate newvar = exp [if] [in]
• exp may be a number, a character string or a mathematical function
• generate constant = 1
-
Create a constant equal to 1
• generate constant_text = "text"
-
Create a constant that contains the character string "text"
• generate logwage = ln(wage)
-
Create a variable equal to the natural logarithm of wage
• generate expersq = expr^2
-
Create a variable equal to the square of exper
TRANSFORMING
DATA
 To create specific variables using time series operators
› generate lag_gdp = L.gdp
• Create a variable corresponding to the first lag of gdp
› generate lead_gdp = F.gdp
• Create a variable corresponding to the first lead of gdp
› generate diff_gdp = D.gdp
• Create a variable corresponding to the first difference of gdp
 But before you should tell Stata that you are working with time series
data using the command: tsset
› tsset time [, yearly monthly quarterly daily]
 Using system variables
› generate gdp_growth = ((gdp[_n] - gdp[_n-1]) / gdp[_n-1])*100
• Create a variable equal to the growth rate of gdp
TRANSFORMING
DATA
 To modify an existing variable: replace
› replace wage=20 if wage>=20
 To rename an existing variable: rename
› rename wage hourly_wage
 You can also add a brief description to the variable using labels
› label variable educ "total years of education"
TRANSFORMING
DATA
 When transforming data, one must be careful with missing values
› Missing values in Stata are coded with a . (period)
 Stata treats missing values as large numbers, higher than any other
values of a given variable
›
In certain cases you should use the if qualifier to exclude missing values
• generate rich = (wage>15) if wage<.
|or|
• generate rich = (wage>15) if wage!=.
|or|
• generate rich = (wage>15) if !missing(wage)
SECTION 3
DATA ANALYSIS
• Descriptive statistics
• Linear regression
• Exporting results
DESCRIPTIVE
STATISTICS
 Categorical variables
› One-way table of frequencies
• tabulate female
-
The option [, missing] displays the total frequency of missing observations
› Two-way table of frequencies
• tabulate female married
 Continuous variables
› summarize gives the number of observations, the mean, the standard
deviation, the minimum and maximum values
• summarize wage educ
-
The option [, detail] displays the main quantiles, the highest and lowest five values, the
variance, as well as the skewness and kurtosis measures
 Pearson’s correlation coefficient
› correlate varlist [, covariance]
DESCRIPTIVE
STATISTICS
 Exploring data with graphs
› Distribution of a continuous variable: histogram
• histogram wage
-
the option [, normal] draws a normal density line on the plot
› Scatter plot between two variables: scatter
• scatter wage educ
› Evolution of time series: tsline
• tsline gdp
-
available only after tsset
SECTION 3
DATA ANALYSIS
• Descriptive statistics
• Linear regression
• Exporting results
LINEAR
REGRESSION
 We seek to estimate the relationship between one dependent variable
and a set of independent variables
› using the Ordinary Least Squares (OLS) estimator
 Classical linear model assumptions (Wooldridge, 2008):
›
›
›
›
›
›
Model is linear in parameters
Data are random sample of the population
No perfect collinearity between independent variables
Zero conditional mean of error term
Homoskedasticity
Normality of the residuals
LINEAR
REGRESSION
 The model we want to estimate:
› log(wage) = 𝛽0 + 𝛽 1education + 𝛽 2experience + 𝛽3tenure + u
• where:
-
wage is average hourly earnings in dollars
-
education is the number of years of education
-
experience is the number of years of labour market experience
-
tenure is the number of years with the current employer
 In Stata:
› regress logwage educ exper tenure
LINEAR
 Stata output
REGRESSION
LINEAR
 Analysis of variance
› Sum of Squares (SS)
• Explained variance (model)
• Residual variance
• Total variance
› Degrees of freedom (df)
› Mean Squares (MS)
• SS divided by df
REGRESSION
LINEAR
REGRESSION
 Overall model fit
› Number of observations
› F-statistic
› p-value associated with the F-statistic
• testing the null hypothesis that all of the model
coefficients are 0
› R-squared
• proportion of variance in the dependent variable
explained by the independent variables
-
SS(model) divided by SS(total)
› Adjusted R-squared
› Standard deviation of the error term
•
𝑀𝑆(𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙)
LINEAR
REGRESSION
 Parameters estimates
› Dependent variable (1)
› Independent variables and intercept (2)
› Coefficients (3)
› Standard-errors (4)
› t-statistics (5)
› p-values associated with the t-statistics (6)
• testing the null hypothesis that a given coefficient is 0
› 95% confidence intervals (7)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
LINEAR
REGRESSION
 Predicting fitted values and residuals
› predict wage_fitted
• e.g. 1,304921 = 0,2843595 + 11*0,092029 + 2*0,0041211 + 0*0,0220672
› predict wage_resid, r
• e.g. -0,1735185 = 1,131402 – 1,304921
logwage
educ
exper
tenure
wage_fitted
wage_resid
1
1,131402
11
2
0
1,304921
-0,1735185
2
1,175573
12
22
2
1,523506
-0,3479329
3
1,098612
11
2
0
1,304921
-0,2063083
4
1,791759
8
44
28
1,819802
-0,0280429
5
1,667707
12
7
2
1,461690
0,2060172
6
2,169054
16
9
8
1,970451
0,1986027
7
2,420368
18
15
7
2,157168
0,2631997
8
1,609438
12
5
3
1,475515
0,1339233
9
1,280934
12
26
4
1,584125
-0,3031912
10
2,900322
17
22
21
2,402928
0,4973939
LINEAR
REGRESSION
 Incorporating categorical information into regression models
 Dummy variables (coded as 0/1) can be included as such in the
regression
› regress wage educ exper tenure female
 Categorical variables with more than two categories must be included
using the i. prefix
› regress wage educ exper tenure i.region
• Stata will automatically create dummy variables for each category and
incorporate them in the regression except the reference category
-
You can use the prefix ib(x). instead to change the reference category
LINEAR
REGRESSION
 Post-estimation tests
› Multicollinearity (Wooldridge, 2008 - chapter 3, p99)
• estat vif
-
Rule of thumb, if variance inflation factor>10, multicollinearity problem
› Normality of the residuals
• sktest varname
-
testing the null hypothesis that variable follows a standard normal distribution
• swilk|sfrancia varname
-
Shapiro-Wilk and Shapiro-Francia test
› Homoskedasticity (Wooldridge, 2008 - chapter 8)
• estat hettest
-
Breusch-Pagan test, testing the null hypothesis of homoskedasticity
• estat imtest, white
-
White test, testing the null hypothesis of homoskedasticity
• The [, robust] option after regress gives heteroskedasticity-robust standard errors
› F-test: testing that a group of variables has no effect on the dependent
variable – joint hypotheses test (Wooldridge, 2008 - chapter 4, p143)
• test var1 var2
SECTION 3
DATA ANALYSIS
• Descriptive statistics
• Linear regression
• Exporting results
EXPORTING
RESULTS
 outreg2 allows to easily export the results of one or several
regressions
› to Microsoft Office applications: Word, Excel
› to LaTeX
 outreg2 [estlist] using filename [, word excel tex]
› [estlist] refers to the list of estimation results previously saved using the
command: estimates store estname
EXPORTING
regress logwage educ
estimates store est1
regress logwage educ exper tenure
estimates store est2
regress logwage educ exper tenure female
estimates store est3
outreg2 [est1 est2 est3] using output, word
RESULTS
Download