uk12_hagemejer_tyrowicz

advertisement
Experiences with multiple propensity
score matching
Jan Hagemejer & Joanna Tyrowicz
University of Warsaw & National Bank of Poland
Plan
1.
Standard solutions to the automatisation challenge
2.
Where they do not work? Example of propensity score matching

Using loops and global function together

Generating the resultssets for atypical estimations.

Difficulties with using bootstrap (and obtaining resultssets)
3.
Summary comments … and some (hard learned) advices
2
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
The standard route

Problem: several estimations of similar form + need to compare results.

Three simple solutions:
 Solution 1: brute force = sit & type (copy / paste from output)
 Solution 2: use parmest (Roger Newson) if estimations on simple
categories in data (limitations of „by” command)
 Solution 3: use loops
 outreg/outreg2
 nicely formatted tables,
 publication-ready,
 in many formats, even directly to Word or LaTeX.
 Note: if you need nice summary statistics, you can use outsum
either with by or within loops
3
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Where the problems come from?


2nd and 3rd solution works only with regression-type estimations
However, some procedures are incompatible with pre-cooked solutions

Need to report:
 output of the procedure
 sample properties after matching
 balancing properties of matching

Problem1: actually, none of these is in the typical output

Problem2: we need it for many estimations looped over many variables
and each one of them takes a looooong time
4
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Detailed problem description

Analyse the effects of privatisation


Observe what happens before and after the „event” of privatisation


E.g. profits, investments, international competitiveness, employment, productivity
Effects may be due to self-selection


E.g. firm A may be one year before privatisation in 1999 and firm B in 2006, so „event” is an
anchor and time „runs” both ways.
Effects may be observed in many spheres:


Take two firms A and A’. Firm A gets privatized. Firm A’ does not get privatised (ever). Want to
compare firms A and A’ each year before and after privatisation of firm A (in fact we are
comparing private firms to privatized SOEs due to few SOEs left in the sample)
E.g. only better firms are privatised, so difference in performance is not due to privatisation
(there might be other effects why firms are privatised related to, for instance, budget presure).
Use propensity score matching to compare privatised firms to nonprivatised firms
5
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
What we want to get:
6
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Detailed problem description
Thus, in our case:








7
Many time periods (for each „time-to-anchor” a separate estimation)
Many variables (for each variable separate outcomes, but within one
„anchor” the same balancing properties)
Two ways of estimating: regular and bootstrapping (especially the latter
made things complex)
Each estimation: roughly 1.5-3.5 hours (big dataset)
Over a hundred estimations
To verify if matching is ok, need to check balancing properties
Additional pitfalls:

We needed some statistics for all estimations and they were not in the
return list

More precisely: procedure computes them to be able to produce output,
but they were not added to the return list by authors
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Summary of the problems
Our problem was quite specific… BUT consisted of many general problems:
1.
Loops take a lot of time – need to find efficient ways
2.
Some things cannot be obtained fast => even more reasons to run it
automatically
3.
Obtaining datasets of the results we need (so-called resultssets)

Getting visible data if they are not an output

Using invisible data
4.
Getting around with bootstrap
8
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
The structure of our estimations
Specific loops
• Balancing properties
• Before and after matching
statistics
Loop for variables (15 variables)
• Run standard estimations
• Run bootstrap estimation
Loop for time (12 periods)
9
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
How global function can be usefull?
Using the global function for estimations

Our application: observe the same firms back and forth from the moment of
privatisation („anchor”)
 „Anchors” happen in different years
 But we can only match on one dimension: has or has not the „anchor”

Conceptual solution: use lags and forwards to get the time dimension

Technical problem: many outcome variables and de facto many loops

Technical solution: define separately matching variables and output variables
global in=„capital roa export_status etc…” MATCHING VARS!
global out=„productivity employment efficiency etc…” COMPARISON VARS!
global outf=„forwards of $out”
11
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Getting from results to „resultssets”
Why (and what) do we need (in) the resultssets?

Why?
 Most importantly: without resultssets we cannot
 analyse the changes over time
 decompose the observed differentials
 If we do not do it automatically, it would have to be copied manually
from logs – many estimations, many variables, etc

What ? Step 1: Find out the reality
1. Size of each of the three groups: treated, total and control (= matched)
2. Averages in all three groups (medians, etc.)
3. Knowledge if in fact they are different (= test of the statistical
significance based on difference and standard error of this difference)

What? Step 2: find out, how good the findings are statistically
1. Balancing properties!
13
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Our solution to step 1

Initialize the store for our resultsets using postfile. Index the result table
with variable names, years and other things that the code loops around
tempname memhold
postfile memhold indices
variable_names_for_results

Start the big loop (event)
forvalues d=6(1)18

{
Run pscore (needed for bootstrap) and subsequently psmatch
psmatch2 d`d' our_pscore_`d', out($out $outf $outl)
some options
14
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Our solution to step 1

Run pscore and psmatch
psmatch2 d`d' our_pscore_`d', out($out $outf
$outl) some options

Start the loop
foreach out in $out $outf1 $outf2 {

Generate means and standard errors for treaded/matched/unmatched,
using output from psmatch (some more about this later)
local se_after=r(seatt_`out')

Post the `locals’ to the postfile using post command in each loop iteration
15
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Our solution to step 2

For balancing properties we need to use pstest over all the matching variables
pstest $in

In order to produce nice tables, we need to loop over all the matching variables in $in and
create some ‚locals’ in memory to later save them as separate variables:
foreach in in $in {
capture local bias_reduction=r(bired_`in')
capture local pvalue_bef=r(pbef_`in')
capture local pvalue_after=r(paft_`in')
capture gen b_red_`in'=`bias_reduction'
capture gen pval_ber_`in'=`pvalue_bef'
capture gen pval_aft_`in'=`pvalue_after‚

}
Spit out everything to a spreadsheet (alternatively you can use postfile again):
outsheet b_red* pval* using stats_priv_`d', replace

Make some graphs and clean up
psgraph
graph save priv_support_`d', replace
drop b_red* pval*
16
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
„Missing statistics”
Solving problem of „missing” statistics


Psmatch produces nice tables with all the required statistics. However,
they are only shown on the screen and vanish right after that
Look into the „ado” file you are using (procedure)
 Throughout the file, there are commands
return scalar x=`somelocal’
 Sometimes – for clarity – scalars are dropped at the end of
procedure
 Your prefered statistic (if it is in the output, it has to be at least a
local) would simply have to have a local like that too
 If it does not – you can always generate it based on your
preferences and available locals
=> Modify the original ado file
18
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Solving problem of „missing” statistics – example 1
Original ado file – line 380
qui foreach v of varlist `varlist' {
replace _`v' = . if _support==0
tempname m1t m0t u0u u1u att dif0
sum `v' if _treated==1, mean
scalar `u1u' = r(mean)
sum `v' if _treated==0, mean
scalar `u0u' = r(mean)
sum `v' if _treated==1 & _support==1, mean
scalar `m1t' = r(mean)
local n1 = r(N)
sum _`v' if _treated==1 & _support==1, mean
scalar `m0t' = r(mean)
scalar `att' = `m1t' - `m0t'
scalar `dif0' = `u1u' - `u0u‘
return scalar att = `att'
return scalar att_`v' = `att‚
Modified ado file – line 380
qui foreach v of varlist `varlist' {
replace _`v' = . if _support==0
tempname m1t m0t u0u u1u att dif0
…
/all the same as earlier plus /
return scalar diff = `dif0'
return scalar diff_`v' = `dif0‘
return scalar mean0 = `u0u'
return scalar mean0_`v' = `u0u‘
return scalar mean1 = `u1u'
return scalar mean1_`v' = `u1u'
/no „return” of needed scalars/
19
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Solving problem of „missing” statistics – example 2
Original ado file – line 440
Modified ado file – line 440
return scalar seatt = `stderr'
return scalar seatt = `stderr'
return scalar seatt_`v' = `stderr'
return scalar seatt_`v' = `stderr'
qui regress `v' _treated
qui regress `v' _treated
scalar `ols' = _b[_treated]
scalar `ols' = _b[_treated]
scalar `seols' = _se[_treated]
scalar `seols' = _se[_treated]
return scalar seols = `seols‘
return scalar seols_`v' = `seols'
20
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Problems with bootstrap
Problems with bootstrap


The psmatch procedure does not take into account when calculating se’s
that the propensity score is estimated. A possible solution to this is to use
bootstrap.
What problems with bootstrap?
 Need to run it separately for each variable (it bootstraps only one
standard error at a time)
 Output is given in a totally different form
 It takes a looong time

22
New piece of code for just BS standard errors =>
new variable loops within each time loop
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Problems with bootstrap


Again, create the postfile
Run the actual bootstrap in loops (post results in every iteration)
foreach out in $out $outf1 $outf2 {
use data, clear
bootstrap r(att): psmatch2 d`d‘ $in, out(`out') some
options
matrix mat = e(b), e(se) /without this, no resultssets/
svmat mat /convert matrix to variables/
rename mat1 a`d'_diff_after_bs_`out‘/create meaningful names/
rename mat2 a`d'_se_after_bs_`out‘
gen time_of_event=`d
post `postfile’ indices (a`d'_diff_after_bs_`out‘)
(a`d'_se_after_bs_`out‘)
}
postfile close
23
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Final steps
1.
2.
3.
4.
24
Merge files obtained from bootstrap on „anchor” (to
have a complete resultsset within each „anchor”
period)
Organise the data
Produce tables and graphs (again in loops)
Write paper
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
The resulting graphs (1)

6 figures showing levels for 3 groups (15 matches each)
25
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
The resulting graphs (2)

6 figures showing the decomposition of the treatedunmatched difference (15 matches each)
26
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
The resulting graphs (3)

6xn figures showing the „balanced panel” version for all
variables of the treated-unmatched difference
27
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Some advices we did not take at the right time 
1.
Use „sample 10” for testing procedures - saves a lot of time
2.
Leaving mess is not useful if we ever want to come back

Your memory lasts shorter than that of saved files – describing
dofiles really helps

Loops are better than copy&paste – and less messy too
3.
Beware of changes in STATA syntax (all the time…)
28
Jan Hagemejer & Joanna Tyrowicz
SUGM London, 2012
Thank you for your attention!
Jan Hagemejer & Joanna Tyrowicz
University of Warsaw and National Bank of Poland
Download