Optimizing Stata for Analysis of
Large Data Sets
Joseph Canner, MHS
Eric Schneider, PhD
Johns Hopkins University
Stata Conference
New Orleans, LA
July 19, 2013
Background
• Programmer/Statistician: 20 years
experience with SAS
• Took new job and started using Stata in
January 2013
• Reviewed many do-files from
predecessors and colleagues in order to
learn Stata and understand new job
Caveats
• Large data sets: irrelevant if you don’t
use large data sets and/or if you don’t
have a system that has sufficient
memory to analyze large data sets
• Coding practices: these are examples
from real users, but not necessarily
trained programmers or Stata experts
Benchmark Testing
•
•
•
•
•
•
NIS 2010 Core (unless noted otherwise)
7,800,441 observations
155 variables
5.6 Gb memory
25 ICD-9 diagnosis codes (DX1-DX25)
15 ICD-9 procedure codes (PR1-PR15)
Benchmark Testing
• Testing code:
timer clear 1
timer on 1
…Code to be tested…
timer off 1
timer list 1
• Groups of tests always run at the same time to
eliminate issues with different
server/memory/usage conditions
– 24 core CPU, 256 Gb RAM (50% load), Windows
2008
Test #1: Coding ICD-9 variables
• Option 1:
gen FOREACH=0
forvalues x = 1/15 {
foreach value in "7359" "741"
"9955" "640" {
replace FOREACH=1 if
PR`x'=="`value'"
}
}
• Time=27.6 sec
Test #1: Coding ICD-9 variables
• Option 2:
gen IFOR=0
forvalues x = 1/15 {
replace IFOR=1 if PR`x'=="7359" |
PR`x'=="741" | PR`x'=="9955" |
PR`x'=="640"
}
• Time=13.2 (half the time!)
Test #1: Coding ICD-9 variables
• Option 3:
gen INLIST=0
forvalues x = 1/15 {
replace INLIST=1 if
inlist(PR`x',"7359","741",
"9955","640")
}
• Time=9.6 sec (a little better than Option 2,
and easier to write and read)
Test #1a: Coding single ICD-9 variables
inlist() vs. recode
• Option 1:
gen INLIST1=0
replace INLIST1=1 if
inlist(PR1,"7359","741","9955","640",
"9904","8154","7569","3893")
• Time=1.2 sec
Test #1a: Coding single ICD-9 variables
inlist() vs. recode
• Option 2a:
destring PR1, gen(tempPR1)
ignore("incvl")
recode tempPR1 (7359 741 9955 640
9904 8154 7569 3893 = 1)
(else=0), gen(RECODE)
drop tempPR1
• Time=118.1 sec (Ouch! Much of the time
is devoted to the destring command)
Test #1a: Coding single ICD-9 variables
inlist() vs. recode
• Option 2b (use real() instead of destring):
gen tempPR1=real(PR1)
recode tempPR1 (7359 741 9955 640
9904 8154 7569 3893 = 1)
(else=0), gen(RECODE)
drop tempPR1
• Time=26.0 sec (much better than
destring, but still much slower than inlist())
Test #1b: Coding single ICD-9 variables
when there are ranges
• Option 1:
split ECODE1, gen(nECODE) parse(E)
destring nECODE2, gen(iECODE1)
drop nECODE2
recode iECODE1
(9200/9209 956 966 986 974 = 1)…
(8800/8869 888 9570/9579 9681 9870 =2)
(9220/9223 9228 9229 9550/9554 9650/9654
9794 9850/9854 970=3)
(8100/8199 9585 9685 9885=4), gen(mech1)
recode mech1 (5/10000=5)
• Time= 142.6 sec (Again, split and destring
take the bulk of the time here.)
Test #1b: Coding single ICD-9 variables
when there are ranges
• Option 2:
iECODE1=real(substr(ECODE1,2,4))
recode iECODE1
(9200/9209 956 966 986 974 =1)…
() () ()…, gen(mech2)
recode mech2 (5/10000=5)
• Time= 68.7 sec; better, but…
Test #1b: Coding single ICD-9 variables
when there are ranges
• Option 3:
gen mech3=.
replace mech3=1 if (ECODE1>="E9200" &
ECODE1<="E9209") |
inlist(ECODE1,"E956","E966",
"E986","E974")
…
replace mech3=5 if mech3==. &
substr(ECODE1,1,1)=="E"
• Time=5.74 sec (a little harder to write, but
much faster!)
Test #1b: Coding single ICD-9 variables
when there are ranges
• Option 4:
gen mech4=.
replace mech4=1 if
inrange(ECODE1,"E9200”,"E9209") |
inlist(ECODE1,"E956","E966",
"E986","E974")
…
replace mech4=5 if mech3==. &
substr(ECODE1,1,1)=="E"
• Time=5.32 sec (a little faster still, and
much easier to write)
Test #1: Coding ICD-9 Variables
Conclusions
• Using inlist() reduces the time required to
recode ICD-9 variables by 65% when searching
15 variables for 4 target codes.
• Performance improves to 80% for 8 codes, and
continues to improve slightly thereafter, with a
maximum improvement of 92%. (Note: inlist()
limit is 10 string codes or 255 numeric codes)
• In order to “stress” the test, the codes used in
the test are the most popular, but the results
are the same for any set of codes.
Test #1: Coding ICD-9 Variables
Conclusions (cont’d)
• Using recode is much slower than inlist()
for lists of single ICD-9 codes, in large part
because of the need to convert from string
to numeric
• Using recode for ranges is also much
slower than replace/if, for the same
reason; inrange() also helps with
readability
• Can use real() instead of destring,
substring() instead of split
Test #2: Recoding continuous
variables
• Option 1:
gen AGE1=.
replace AGE1=1 if AGE>=0 & AGE <=9
replace AGE1=2 if AGE>=10 & AGE <=19
…
replace AGE1=10 if AGE>=90 & AGE <=120
• Time=6.6 sec
Test #2: Recoding continuous
variables
• Option 2:
gen
AGE2=recode(AGE,9,19,29,39,49,
59,69,79,89,120)
• Time=0.66 sec (exactly one-tenth of the
time(!) and easier to write and read)
• Caution: need to be careful with truly
continuous variables that you are
cutting at the right place
Test #2: Recoding continuous
variables
• Option 3:
recode AGE (0/9=1) (10/19=2) (20/29=3)
(30/39=4) (40/49=5) (50/59=6) (60/69=7)
(70/79=8) (80/89=9) (90/120=10),
gen(AGE3)
• Time=46.3 sec (Ouch!) and harder to
write
• May be useful for instances where ranges
are not mutually exclusive (i.e., can’t use
recode function)
Test #3: Reordering Values
• Option 1:
gen sex_new=sex
replace sex_new=0
replace sex_new=5
replace sex_new=4
replace sex_new=1
replace sex_new=2
if
if
if
if
if
sex_new==3
sex_new==2
sex_new==1
sex_new==5
sex_new==4
• Time=2.0 sec; very cumbersome and
hard to follow
Test #3: Reordering Values
• Option 2:
recode sex (3=0) (1=2) (2=1),
gen(sex_new1)
• Time=15.0 sec (Ouch! ); but, easier to
write and MUCH easier to read)
• Can also use recode to do things like:
(3 4 = 0) // 3 and 4 are recoded to 0
(3/5 = 0) // 3, 4, and 5 are recoded to
0
Test #3: Reordering Values
• Option 3:
gen sex_new=sex
replace sex_new=0 if sex==3
replace sex_new=1 if sex==2
replace sex_new=2 if sex==1
• Time=1.4 sec (Faster than Option #1 by
40% and not too hard to read/write)
Test #4 De-stringing Numeric Values
(e.g., NSQIP age)
• Option 1 (Variation of Test #3 Option #1):
encode age, gen (age_new)
replace age_new=180 if age_new==1
…
replace age_new=900 if age_new==73
replace age_new=18 if age_new==180
…
replace age_new=90 if age_new==900
• Time=25.8 sec (NSQIP 2011; n=442,149),
• Always need to do “tab age_new, nolabel”
because labels are messed up
Test #4 Destringing Numeric Values
(e.g., NSQIP age)
• Option 2:
destring age, gen(age_new1) ignore(“+”)
• Time=6.3 sec (NSQIP 2011;
n=442,149); four times faster!
• Caution: make sure it is clear that
89=89+
Test #4a Removing Characters from
ID Numbers (e.g., XXX-XX-XXXX)
• Option 1
destring SSN, ignore("-") gen(newSSN1)
• Time=33.0 sec
Test #4a Removing Characters from
ID Numbers (e.g., XXX-XX-XXXX)
• Option 2:
gen long newSSN2=
real(subinstr(SSN,"-","",.))
• Time=1.7 sec; almost 20 times faster!
• Only useful if there are a few
characters to get rid of.
Future Tests
• Confirm results for 10 years of NIS
(about 80 million observations, nearly
50 Gb RAM)
• Other Stata commands where there are
multiple ways to do the same
thing…any ideas?
• Other programming practices found
reviewing code written by colleagues
and students
Implications
• With 10 years of NIS, could save…
–
–
–
–
3 minutes per ICD-9 recode
1 minute per continuous variable categorization
6 seconds per variable reorder
A lot more if you used recode
• It all adds up!
• Might make it less onerous to run recoding and
cleaning programs more often instead of saving
new copies of the dataset
• Easier to read programs