Automatization in Stata—application to propensity

advertisement
Automatisation in Stata
Jan Hagemejer
& Joanna Tyrowicz
Plan
1.
Standard solutions
2.
Where they do not work?

Usually more than one way to estimate – how to chose?

Using loops and global function together

Generating the resultssets for atypical estimations.

Difficulties with using bootstrap (and obtaining resultssets)
3.
Summary comments … and some advices
2
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
The standard route
Problem: several estimations of similar form.
Need to compare results.


Three simple solutions:
 Solution 1: brute force = sit & type
 Solution 2: use parmby/parmest: if estimations on simple categories
in data (limitations of „by” command)
 Solution 3: use loops
 See N. Cox’s material from previous SUGM)
 Commands developed by Roger Newson: outreg/outreg2
 nicely formatted tables,
 publication-ready,
 in many formats, even LaTeX.
 Note: if you need nice summary statistics, you can use outsum
either with by or within loops

3
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Where the problems come from?



2nd and 3rd solution works only with regression-type estimations
However, some procedures are incompatible with pre-cooked solutions
Examples:
 Marginal effects,
 Use outreg2 in Stata10 if use dprobit/logit instead of
probit/logit
 Use outreg2 in Stata11 with margins and/or mfx2 (remeber
about replace option)

Nice statistics
 Use tempname and postfile syntax

Rolling window on any of this type of analysis
4
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Not everything may be solved this way…

Reason 1: things more complex than they seem (to come in a sec..)

Reason 2: some things are not listed in the output:
 Example: various versions of R2 or sample size in simple regressions
 outreg/parmest typically do not include them
 they can be included as additional locals
 you need to know what locals they are => solution: the family of
„return list” commands
 ret li => results stored in r(), general commands
 eret li => results stored in e(), estimation commands
 sret li => results stored in s(), programming commands

5
Practical example
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Cookbook for „simple” problems
Run procedure
Check with the use of „return list” family, which statistics you need
Add locals that should be generated after the procedure
Add these statistics to outreg2/parmest commands
1.
2.
3.
4.
forvalues no=1(1)10 {
xi: xtreg x y z i.year i.month if g`no'==1, fe robust
local Between=e(r2_b)
local Within=e(r2_w)
local No_min=e(g_min)
local No_max=e(g_max)
outreg2 using file.xls, bdec(4) title(Title) ctitle(`no')
append excel addstat(R2 between, `Between', R2 within,
`Within', No min, `No_min', No max, `No_max', No average,
`No_avg')
}
6
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Our problem is different – application to PSM

Need to report:
 output of the procedure
 sample properties after matching
 balancing properties of matching

Problem1: actually, none of these is in the typical output

Problem2: we need it for many estimations looped over many variables
and each one of them takes a looooong time
7
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Detailed problem description

Analyse the effects of privatisation

Observe what happens before and after the „event” of privatisation, but
time runs:


Effects may be observed in many spheres:


E.g. only better firms are privatised, so difference in performance is not
due to the privatisation
Effects may be largerly due to self-selection


E.g. profits, investments, international competitiveness, employment
Effects may be due to self-selection


E.g. firm A may be one year before privatisation in 1999 and firm B in
2006, so „event” is an anchor and time „runs” both ways.
Heckman correction will tell about the statistical significance but not
about the economic relevance
Propensity score matching is the best solution
8
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Detailed problem desciption
Run logistic regression:
1.

Dependent variable: Y = 1, if participate; Y = 0, otherwise.
2.
Choose appropriate conditioning (instrumental) variables.
3.
Obtain propensity score: predicted probability (p) or log[p/(1 − p)].
4.
Match each participant to one or more nonparticipants on propensity score:

Choose an adequate metric
Compare outcome variables
5.

Example: test means equality in sample treated and control group
6.
In PSM: obtaining pscore is irrelevant, but matching is key
7.
To verify if matching is ok, need to run some diagnostics

9
Example: compare the balancing properties after matching (so-called bias
reduction thanks to matching)
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Detailed problem description

Thus, in our case:






Many time periods (for each „time-to-anchor” a separate estimation)
Many variables (for each variable separate outcomes, but within one
period the same balancing properties)
Two ways of estimating: regular and bootstrapping (especially the latter
made things complex)
Each estimation: roughly 1.5-3.5 hours
Over a hundred estimations
Additional pitfalls:
10

We needed some statistics for all estimations and they were not in the
return list

More precisely: procedure computes them to be able to produce output,
but they were not added to the return list by authors
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Summary of the problems
Our problem was quite specific… BUT consisted of many general problems:
1.
Loops take a lot of time – need to find efficient ways
2.
Some things cannot be obtained fast => even more reasons to run it
automatically
3.
Obtaining datasets of the variables we need (so-called resultssets)

Getting visible data if they are not an output

Using invisible data
4.
Getting around with bootstrap
11
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
The structure of our estimations
Specific loops
• Balancing properties
• Before and after matching
statistics
Loop for variables (30 variables)
• Run standard estimations
• Run bootstrap estimation
Loop for time (12 periods)
12
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Using pscore or psmatch?
Using pscore or psmatch?

Typical psmatch syntax:
psmatch2 treat treatment_determinants, out(outcomes)
options

Alternative
 Estimate pscore first:
pscore treatment treatment_determinants, pscore(name)
Run:
psmatch2 treatment pscore, out(outcomes) options


How to choose?
 If you want to bootstrap, pscore estimated once will save you time
 If you want to introduce data-fitted caliper into options, pscore first is
a must
14
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
How global function can be usefull?
Using the global function for estimations




Our application: observe the same firms back and forth from the moment of
the privtisation („event”)
 „Events” happen in different years
 But we can only match on one dimension: has or has not the „event”
Conceptual solution: use lags and forwards to get the time dimension
Technical problem: many outcomes variables and de facto many loops
Technical solution: define separately matching variables and output variables
global in="cut* remoteness eksporter energia obrot klratio roa ros
indebtedness wsk_plynnosci net_income_efficiency klratio_new roa_new
indebtedness_new indebtedness_new wsk_plynnosci_new"
global out="te_new redukcja wzrost_zatr share_export lewar s_eff"
global outf1="ff1_te_new ff2_te_new ff3_te_new ff4_te_new ff5_te_new
ff1_redukcja ff2_redukcja ff3_redukcja ff4_redukcja ff5_redukcja
ff1_wzrost_zatr ff2_wzrost_zatr ff3_wzrost_zatr ff4_wzrost_zatr
ff5_wzrost_zatr"
global outf2="ff1_share_export ff2_share_export ff3_share_export
ff4_share_export ff5_share_export ff1_lewar ff2_lewar ff3_lewar ff4_lewar
ff5_lewar ff1_s_eff ff2_s_eff ff3_s_eff ff4_s_eff ff5_s_eff"
16
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
The begining of the estimations – so far
forvalues d=6(1)18 {
use data, clear
capture log close
capture drop our_pscore* caliper* mean* diff* ttest* se_after* se_before*
treated nontreated
log using priv_caliper_`d', text replace
pscore d`d' $in, pscore(our_pscore_`d')
ttest our_pscore_`d', by(d`d') unequal
capture drop sd_nontreated sd_treated
gen sd_nontreated=`r(sd_1)'
gen sd_treated=`r(sd_2)'
gen caliper_`d'= ((sd_treated^2+sd_nontreated^2)/2)^0.5
sum caliper_`d'
local c_real=`r(mean)'
hist nasz_pscore_`d', by(d`d')
graph save „our_pscore_d`d'.png", replace
psmatch2 d`d' our_pscore_`d', out($out $outf1 $outf2) common add
mahalanobis(nace) caliper(`c_real')
17
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Getting from results to „resultssets”
18
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd,
2010
Why (and what) do we need (in) the resultssets?

Why?
 Most importantly: without resultssets we cannot
 analyse the changes over time
 decompose the observed differentials
 If we do not do it automatically, it would have to be copied manually
from logs – many estimations, many variables, etc

What ? Step 1: find out the reality
1. Size of each of the three groups: treated, total and control (= matched)
2. Averages in all three groups (medians, etc.)
3. Knowledge if in fact they are different (= test of the statistical
significance based on difference and standard error of this difference)

What? Step 2: find out, how good the findings are statistically
1. Balancing properties!
19
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Our solution to step 1
foreach out in $out $outf1 $outf2 {
local se_after=r(seatt_`out')
gen se_after_`out'=`se_after'
local diff_after=r(att_`out')
gen diff_after_`out'=`diff_after'
sum `out' if d`d'==0 & _support==1
local mean_nontreated=r(mean)
gen mean_nontreated_`out'=`mean_nontreated'
sum `out' if d`d'==1 & _support==1
local mean_treated=r(mean)
gen mean_treated_`out'=`mean_treated'
ttest `out' if _support==1, by(d`d') unequal
local se_before=r(se)
gen se_before_`out'=`se_before'
local mean_before=r(mu_2)-r(mu_1)
gen diff_before_`out'=`mean_before'
gen ttest_before_`out'=diff_before_`out'/se_before_`out'
gen ttest_after_`out'=diff_after_`out'/se_after_`out‘
CONTINUED ON THE NEXT SLIDE
20
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Our solution to step 1 - continued
foreach type in before after {
label var se_`type'_`out' "Standard error of difference `type' matching"
label var diff_`type'_`out' "Difference `type' matching"
label var ttest_`type'_`out' "T-test of difference"
}
label var mean_treated_`out' "Mean of treated companies"
label var mean_nontreated_`out' "Mean of non-treated companies (before matching)"
}
count if d`d'==1 & _support==1
local treated=r(N)
gen treated=`treated'
label var treated "No of treated companies"
count if d`d'==0 & _support==1
local nontreated=r(N)
gen nontreated=`nontreated'
label var nontreated "No of control companies"
21
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Our solution to step 2
pstest $in
foreach in in $in {
capture local bias_reduction=r(bired_`in')
capture local pvalue_bef=r(pbef_`in')
capture local pvalue_after=r(paft_`in')
capture gen b_red_`in'=`bias_reduction'
capture gen pval_ber_`in'=`pvalue_bef'
capture gen pval_aft_`in'=`pvalue_after'
}
outsheet b_red* pval* using stats_priv_`d', replace
psgraph
graph save priv_support_`d', replace
graph export priv_support`d'.png, replace
drop b_red* pval*
22
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
„Missing statistics”
Solving problem of „missing” statistics

Look into the „ado” file you are using (procedure)
 Throughout the file, there are commands
return scalar x=`somelocal’



Sometimes – for clarity – scalars are dropped at the end of
procedure
Your prefered statistic (if it is in the output, it has to be at least a
local) would simply have to have a local like that too
If it does not – you can always generate it based on your
preferences and available locals
=> Modify the original ado file
24
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Solving problem of „missing” statistics – example 1
Original ado file – line 380
qui foreach v of varlist `varlist' {
replace _`v' = . if _support==0
tempname m1t m0t u0u u1u att dif0
sum `v' if _treated==1, mean
scalar `u1u' = r(mean)
sum `v' if _treated==0, mean
scalar `u0u' = r(mean)
sum `v' if _treated==1 & _support==1, mean
scalar `m1t' = r(mean)
local n1 = r(N)
sum _`v' if _treated==1 & _support==1, mean
scalar `m0t' = r(mean)
scalar `att' = `m1t' - `m0t'
scalar `dif0' = `u1u' - `u0u‘
return scalar att = `att'
return scalar att_`v' = `att'
25
Modified ado file – line 380
qui foreach v of varlist `varlist' {
replace _`v' = . if _support==0
tempname m1t m0t u0u u1u att dif0
…
/all the same as earlier plus /
return scalar diff = `dif0'
return scalar diff_`v' = `dif0‘
return scalar mean0 = `u0u'
return scalar mean0_`v' = `u0u‘
return scalar mean1 = `u1u'
return scalar mean1_`v' = `u1u'
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Solving problem of „missing” statistics – example 2
Original ado file – line 440
Modified ado file – line 440
return scalar seatt = `stderr'
return scalar seatt = `stderr'
return scalar seatt_`v' = `stderr'
return scalar seatt_`v' = `stderr'
qui regress `v' _treated
qui regress `v' _treated
scalar `ols' = _b[_treated]
scalar `ols' = _b[_treated]
scalar `seols' = _se[_treated]
scalar `seols' = _se[_treated]
return scalar seols = `seols‘
return scalar seols_`v' = `seols'
26
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Problems with bootstrap
27
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd,
2010
Problems with bootstrap

Why did we need bootstrap?
 After estimations s.e.’s were relatively large (heterogenous sample)
 When we tried bootstraping, the reduction in the size of s.e.’s was
roughly 50% while estimators were essentially unaffected

What problems with bootstrap?
 Need to run it separately for each variable (it bootstraps only one
standard error at a time)
 Output is given in a totally different form
 It takes a looong time

28
New piece of code for just BS standard errors =>
new variable loops within each time loop
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Problems with bootstrap
foreach out in $out $outf1 $outf2 {
use data, clear
sum caliper_`d‘ /this is where the initial pscore comes useful/
local c_real=`r(mean)‘
bootstrap r(att): psmatch2 d`d' our_pscore_`d', out(`out')
common add mahalanobis(nace) caliper(`c_real')
matrix mat = e(b), e(se) /without this, no resultssets/
mat li mat
svmat mat
rename mat1 a`d'_diff_after_bs_`out‘
rename mat2 a`d'_se_after_bs_`out‘
gen time_of_event=`d'
keep se* diff* ttest* mean* time_of_event a*
drop if _n>1
save priv_bs_`out'`d', replace
}
29
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Final steps
1.
2.
3.
4.
5.
6.
30
Merge files obtained from bootstrap on „event” (to
have a complete resultsset within each „event” period)
Merge bootstrap resultssets with
Append the files for „event” periods
Organise the data
Produce tables and graphs (again in loops)
Write paper
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
The resulting graphs (1)

There are 6x3 figures alltogether
31
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
The resulting graphs (2)

There are 6x2 figures alltogether
32
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
The resulting graphs (3)

There are 6x3 figures alltogether
33
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Some advices we did not take at the right time 
1.
Save your computers’ time (your wasted time is your problem )

Use „sample 10” for testing your procedures - saves a lot of time
2.
Leaving mess is not useful if you ever want to come back

Your memory lasts shorter than that of saved files – describing
dofiles really helps

Loops are better than copy&paste – and less messy too
3.
STATA is not that complicated – modifying ado-files is really easy if you
know what you want
34
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Thank you for your attention!
Jan Hagemejer & Joanna Tyrowicz
SUGM Poland, July 2nd, 2010
Download