STATA I: An Introduction Into the Basics
Prof. Dr. Herbert Brücker
University of Bamberg
Seminar “Migration and the Labour Market”
May 23, 2013
Contents
1
2
3
4
5
6
7
8
The STATA Software package
The Structure of STATA: Three files
Getting started
The STATA Menues
The General Structure of STATA
Working with DO FILES
Describe your data
Running regressions
1
STATA SOFTWARE PACKAGE
Image of STATA DVD in the Campus Net under:
“\\software\campliz”
or:
“\\software.uni-bamberg.de\campliz”
Then: Start -> Ausführen
Them: INSERT your licence:
Serial number: ….
Code: ….
Authorisation key: ….
2 Structure of STATA: Three files
1. The DATA file (.dta) where you have your data.
• You can watch you data with the DATA
BROWSER and edit your data with the DATA
EDITOR
2. The DO file (.do) where you run and save your
commands of any session. Very useful (i) to
organise your data set, (ii) to see what you
have done in the last session, (iii) to replicate
what you have done in last session, (iv) to
exchange work with your collaborators.
• You write and run your commands with the
DO FILE EDITOR
2 Structure of STATA: Three files
3. The LOG file (.log) which automatically reports
all things which you have done during your
session. Is automatically saved after your
session. Not often used, but useful if
something goes wrong.
3 Getting started: the STATA empty window
3 Getting started: The STATA empty window
The main window: shows
commands, output and
messages which arrive during
your session
3 Getting started: The STATA empty window
The main window: shows
commands, output and
messages which arrive during
your session
The command window: here
you can type your commands
3 Getting started: The STATA empty window
The main window: shows
commands, output and
messages which arrive during
your session
The variables
window:
Shows variables of
your dataset
The command window: here
you can type your commands
3 Getting started: The STATA empty window
The review
window
reports your
previous
commands
The main window: shows
commands, output and
messages which arrive during
your session
The variables
window:
Shows variables of
your dataset
The command window: here
you can type your commands
3 Getting started: the windows after data loading
Reports
commands
(one in this case)
Reports result of commands
List of variables
3 Getting started
•
•
•
In principle, you can start your STATA session by
(i) loading your data set and (ii) typing your
commands in the command window.
It is however recommended to use the DO FILE
EDITOR right from the beginning.
But let’s look at the STATA menues first.
4 The STATA Menues
The
data
path
The
do file
editor
The
data
editor
The
data
browser
The
variables
manager
The help
menue
•
•
•
For watching your data and changing your data by hand
you need the DATA BROWSER and the DATA EDITOR.
For starting and running your DO files you need the DO
FILE EDITOR.
The other menues are not relevant for the beginning.
4 The STATA Menues: The DATA EDITOR/BROWSER
The difference between the data browser and the data editor is that you can
manipulate data in the editor and only watch them in the browser.
4 The STATA Menues: The DATA EDITOR/BROWSER
STRING
variable
NUMERICAL
variable
You have two types of variables: NUMERICAL variables (black)
and so-called STRING variables (blue) (e.g. text). STATA can
identify STRING variables, but you cannot do numerical
operations with them.
4 The STATA Menues: The DATA EDITOR/BROWSER
HINT: You can transfer data e.g. from an EXCEL file into a STATA
file by copy and paste (STRG C + STRG V) and vice versa in the
data editor. But you have to be careful that you EXCEL is run in
English, otherwise your data might be read as STRING variables
by STATA. Of course there are many other ways to transfer data
from Excel to STATA.
5
The Grammar of STATA
General Structure of STATA commands
[prefix :] command [varlist] [if] [in] [weight] [, options]
5
General structure of STATA
We will concentrate on:
[prefix :] command [varlist] [if] [in] [weight] [, options]
5
General structure of STATA
We will concentrate on:
[prefix :] command [varlist] [if] [in] [weight] [, options]
What you want to do?
5
General structure of STATA
• There are two types of variables (data):
• numerical variables, e.g.: 0, 1, 501, 0.5, -12 etc.
• string variables, e.g.: no voc train , male, female etc.
• How to deal with the data types:
• Numerical variables: you can do all mathematical
operations,
e.g. var1 + var2, var1/var2, var1*var2 etc.
• String variables: You have to use quotation marks
for identifcation, e.g.
• var1 = 1 if sex == “female”
6 Working with DO FILES
•
The standard approach is to start your work with a
DO FILE
• Click on the DO FILE editor button after starting
STATA
• Load an existing DO FILE or start a new one
• Start the DO FILE with a command to load your data,
e.g.
• use “path\data.dta”, clear
or, more specifically, with
•
use “C:\Users\Herbert\Documents\STATA\Wagecurve\DE.dta", clear
Open your DO FILE editor
The
do file
editor
•
After starting STATA click on the DO FILE editor button
How does a DO FILE look like
Descriptions of what you have done in stars *
Commands
The DO FILE menue
Clicking this button runs
the entire DO FILE
(not recommended)
Clicking this button runs a
selection of marked
commands
(recommended)
Note: STATA stops the DO File execution after the first mistake in your
commands. That makes it advisable to proceed step by step.
6 Step 1: Loading your data
•
use “C:\Users\Herbert\Documents\STATA\Wagecurve\DE.dta", clear
•
•
The use command loads the data
the “path\DE.dta” provides STATA the information on
the path where to find the data and the name of the
data file (e.g. DE.dta)
the clear command after the comma clears the
memory, which is needed if you have used other data
sets before
Push the “Execute Selection (DO)” button to run the
selected command(s)
You can also run the entire DO File by pushing the
“Execute Selection Quietly (RUN)” button
•
•
•
Loading your data (I/II)
1. Write the command use „path\XXX.dta“, clear
2. Mark the line and run the command by clicking the
execution button
Loading your data (II/II)
6 Step 2: Manipulating your data (I/VI)
•
It is useful to save only a basic data set and generate
the variables you need at the beginning of each
session. That saves storage space (recommended in
case of large data sets)
• Generating DUMMY variables
• Use the gen command, e.g.
• gen D_ed1 = 0
• This creates a variable consisting only of zeros
• Then use the replace command, e.g.
• replace D_ed1 = 1 if ed1 == 1
• This replaces the zeros with 1 if the variables ed1
has a values of 1.
Generating Dummy Variables: DO FILE commands
Generating Dummy Variables: STATA main window
6 Step 2: Manipulating your data (II/VI)
•
Another example for generating dummy variables:
• Use the gen command, e.g.
• gen year_1 = 0
• This creates a variable consisting only of zeros
• Then use the replace command, e.g.
• Year_1 = 1 if year == 1991
• This replaces the zeros with 1 if the year variable
has a values of 1991
• Note: The STATA syntax requires that you have to
use after an if command always a double == for
the definition of the value
6 Step 2: Manipulating your data (III/VI)
•
Creating series of dummy variables if it is too
cumbersome to create them individually, e.g. in case
of interaction dummies
• Syntax:
• forvalues i = 1/3 {
forvalues j = 1/4{
gen D_ed`i’*D_ex`j’
}}
• i.e. for each value I = 1,2,3 and each value j =
1,2,3,4 you generate an interaction dummy by
multiplying the dummy variables for education
and experience. Take care of the {}!
Generating Dummy Variables: Advanced techniques
Generating Dummy Variables: Advanced techniques
Generating Dummy Variables: Advanced techniques
6 Step 2: Manipulating your data (IV/VI)
•
•
Transforming variables into log variables
Syntax:
• gen ln_wijt = ln wijt
• By using again the gen command you can transform
the wage variable wijt into the natural logarithm of
the wage by applying the ln operator
Transforming data
6 Step 2: Manipulating your data (V/VI)
•
Useful operators in STATA:
•
•
•
•
•
•
+
*
/
ln
exp
add
subtract
multiply
divide
transform into natural log
transform into exponential value
6 Step 2: Manipulating your data (VI/VI)
•
•
Control what you have done
Check you variables for mistakes in the browse
modus of the data set
• You can delete wrong variable by using the drop
command, e.g.
• drop ln_wijt
• Which simply drops your variable from the data set.
Then you can create the correct one.
6 Step 3
•
•
•
•
•
Organize your data with globals
It is not convenient if you have to work with too
many variables, e.g. 200 dummy variables (that is
cumbersome to type some by hand)
You can define globals, which comprise many
variables
Syntax:
• glo [name of global [list of variables]
• glo D_i Ded_1 Ded_2 D_ed3
i.e the global D_i consists of the variables Ded_1
Ded_2 and Ded_3
If you want to use the global later you have to type
• $[globalname], i.e. $D_i
Creating globals
7 Describe your data (I/II)
•
Any econometric analysis requires in the first step
that you provide descriptive statistics to the reader.
This helps to understand what’s going on
• This can be easily done with the sum command
• sum [variable name(s)]
• sum LHijt LFijt wijt ln_wijt
• The sum command creates a table with the complete
descriptive statistics, i.e. observations, mean,
standard deviation, minimum, maximum
Summary statistics
Summary statistics
7 Describe your data (II/II)
•
•
•
•
•
•
Present your data graphically
It is usually helpful if you present the main
information /vairables in your data set graphically
There are many graphical commands, use the
Graphics menue
the simplest way is to show the development of your
variable(s) over time
Syntax:
• graph twoway line [variable1] [variable2] if …
• graph twoway line wqjt year if ed==1 & ex == 1
This produces a two-dimensional variable with the
wage on the vertical and the year on the horizontal
axis for education group 1 and experience group 1
Making a graph
Graph of mean wage in education 1 and experience 1
group
Graph of migration rate in edu 1 and exp 1 group
8 Running regressions
•
•
The standard OLS regression command in STATA is
Syntax
• regress depvar [list of indepvar ] [if], [options]
• regress ln_wijt m_ijt D_i D_j D_t
8
Running Regressions
Recall: What is a linear regression model
The general econometric model:
γi
indicates the dependent (or: endogenous) variable
x1i,ki
exogenous variable, explaining the independent variable
β0
constant or the y-axis intercept (if x = 0)
β1,2,k
regression coefficient or parameter of regression
εi
residual, disturbance term
Running a regression model
Globals !
Regression
command
Dependent
variable
Independent
variables
Running a Regression: Output
How to interpret the output of a regression
variance of model
degrees of
freedom
. reg ln_wqkt mqkt
Source
SS
df
MS
Model
Residual
23.4146717
87.9145738
1
798
23.4146717
.110168639
Total
111.329246
799
.139335727
ln_wqkt
Coef.
mqkt
_cons
-1.369118
4.706176
Std. Err.
.093913
.017403
t
-14.58
270.42
Number of obs
F( 1,
798)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
800
212.53
0.0000
0.2103
0.2093
.33192
[95% Conf. Interval]
-1.553464
4.672015
β1
β0
=
=
=
=
=
=
1. Observations
2. fit of the model
3. F-Test
4. R-squared
5. adjusted Rsquared
6. Root Mean
Standard Error
-1.184772
4.740337
95% confidence interval
analysis of significance levels
8 Running Regressions: Panel Models
•
Very often you use panel models, i.e. models which
have a group and time series dimension
• There exist special estimators for this, e.g. fixed or
random effects models
• A fixed effects model is a model where you have a
fixed (constant) effect for each individual/group.
This is equivalent to a dummy variable for each
group
•
A random effects model is a model where you
have a random effect for each individual group,
which is based on assumptions on the distribution
of individual effects
8 Running Regressions: Panel Models
Preparation for Panel Models:
• For running panel models STATA needs to identify the
group(individual) and time series dimension
• Therefore you need an index for each group and an
index for each time period
• Then use the tsset command to organize you dataset
as a panel data set
• Syntax:
• tsset index year
• where index is the group/individual index and year
the time index
Preparation: Running the tsset command
8 Running Regressions: Panel Models
•
Then you can use panel estimatos, e.g. the xtreg
estimator
• Syntax
• xtregress depvar [list of indepvar ] [if], [options]
• regress ln_wijt m_ijt, fe
• i.e. in the example we run a simple fixed effects panel
regression model which is equivalent to include a
dummy variable for each group (in this case
education-experience group)
Running a Panel Regression: command
Running a Panel Regression: Output
Next Meeting:
June 13, 2013
Room RZ 01.02