Chapter 1-10. Programming Stata In this chapter, we will see how to write programs in Stata. These programs are typically saved as “ado” files. An “ado” file, is simply a file of Stata commands saved with the “ado” file extension and contains “end” on the last line of the file. Since all of the commands in Stata are implemented as an “ado” file, a good source for example Stata code is to think of a command that does something similar to what you want to do, and then go look at the Stata code for that command. Viewing (but cannot edit) an ado-file This is done with the viewsource command. For example, to see how the ttest command was written, open up the ado file in a read-only editor using, viewsource ttest.ado After the file is open, you can highlight it, and cut-and-paste it into the do-file editor so you have the sample code available to you when writing your own programs. Viewing (but cannot edit) a help file This is a very nice application of the viewsource command, because it displays how the special markup features of the help file were set up, so you can do the same thing in your own help files. For example, to see Stata’s template for help files, which was designed to you started with developing your own help files to look like official Stata help files, use viewsource examplehelpfile.hlp To see what the file looks like when it is executed, use help examplehelpfile Finding where an ado file is If you are curious where a particular do-file is stored on your computer, you can do this using the findfile command. findfile ttest.ado C:\Program Files\Stata9\ado\base/t/ttest.ado _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 1-10 (revision 16 May 2010) p. 1 Finding the directories where Stata looks for ado files To see the order in which Stata searches directories when a command is executed, use adopath [1] [2] [3] [4] [5] [6] [7] (UPDATES) (BASE) (SITE) (PERSONAL) (PLUS) (OLDPLACE) "C:\Program Files\Stata9\ado\updates/" "C:\Program Files\Stata9\ado\base/" "C:\Program Files\Stata9\ado\site/" "." "c:\ado\personal/" "c:\ado\plus/" "c:\ado/" The “.” directory is the “current directory, shown in the lower left-hand corner of the Stata window. Usually you would store your own commands in the PERSONAL directory, which is supposed to not be overwritten when you install a new version of Stata. Smart Quotes By the way, the smart quotes, “ ” , that Microsoft Word uses, cannot be interpreted by Stata. So, if you cut-and-paste the following to the command window, display “stuff” you get the following error message: “stuff” invalid name r(198); A way to get around this, is to copy it into the do-file editor. The do-file editor changes it into regular quotes, which looks like, display "stuff" That command can then be executed inside the do-file editor, or cut-and-pasted in the Command window to be executed. Chapter 1-10 (revision 16 May 2010) p. 2 Executing do files from the command line First, decide on the directory where you want to save the do-file to, and change to that directory. If you put it on your desktop, the directory might be something like the following, with “Greg” replaced by your username. cd "C:\Documents and Settings\Greg\Desktop\StataCourse\practice" Now, put the commands you want to run as a batch in a do-file. For example, click on the do-file menu bar icon, which brings up a new do-file. Type the following, display “Hey, it worked!” Then, do a “save as” to the file name program1.do saving it to the current directory you “cd” to above. Now, in the Command window, execute the command, do program1 which executes all of the commands in do-file, and then returns control to the Command window. This do-file, program1.do, is a simple program. It does not mimic a “command”, however, because it requires that you put “do” in front of the do-file name in order to execute it. Chapter 1-10 (revision 16 May 2010) p. 3 Converting your do-file into an ado-file To turn a do-file into an ado-file, you simply add a “program define” on the first line and an “end” on the last line. Open the file program1.do inside the do-file editor, and change it to: program define amazing display "Hey, it worked!" end The indention on the second line (or all lines between program and end) is not necessary, but it helps to remind you that you are inside of the program-end combination. Save it as amazing.ado, instead of program1.do. On the command line, enter amazing Hey, it worked! You have just extended Stata to include a new command called amazing. Chapter 1-10 (revision 16 May 2010) p. 4 Adding some color We can get the display command to output in different colors, similar to what Stata does. text = green result = yellow error = red input = white Open the file amazing.ado inside the do-file editor, and change it to: program define amazing display as text "Hey, it worked!" display as result "Hey, it worked!" display as error "Hey, it worked!" display as input "Hey, it worked!" end On the command line, enter amazing Hey, it worked! Even though we made a change, the older version is still executing. This is because Stata loads programs in memory, and continues to execute the original version stored in Stata memory, even though the file amazing.ado has changed on the hard drive. It is necessary to drop a program from memory, using the program drop command, before we change it. Chapter 1-10 (revision 16 May 2010) p. 5 Dropping a program from memory What I like to do is add that as the first line of my ado file, just to avoid this step every time I make a change. Once the program is fully developed, you can drop that command to avoid a user dropping his own program by the same name already in memory. Open the file amazing.ado inside the do-file editor, and add theh program drop command on the first line. Precede it by “capture” so it runs even if the program is not already loaded in memory. capture program drop amazing program define amazing display as text "Hey, it worked!" display as result "Hey, it worked!" display as error "Hey, it worked!" display as input "Hey, it worked!" end On the command line, enter program drop amazing We have to first drop it from memory, if it is there, using the Command window, since if we just run the command amazing, the old version loaded in memory continues to run. Now if we run the program again, it finds it on the hard drive and runs our updated version amazing Hey, Hey, Hey, Hey, it it it it worked! worked! worked! worked! Chapter 1-10 (revision 16 May 2010) p. 6 Running a program inside a do-file Sometimes it is nicer to just define the program inside a do-file, and then execute it inside the dofile. One advantage is that the program code is displayed right there where we run, which is nice documentation. Another advantage is debugging is faster because we don’t have to keep going back and forth between the do-file and the command line. Let’s try it. With the file amazing.ado in the do-file editor, save it as program2.do. (not .ado). Next add a few blank lines and then put amazing as a command to call the program. (These are in chapter10.do) capture program drop amazing program define amazing display as text "Hey, it worked!" display as result "Hey, it worked!" display as error "Hey, it worked!" display as input "Hey, it worked!" end amazing Highlight the entire do-file and hit the “do current file” icon (third icon from the right) inside the do-file editor to execute it. It executes as expected. Doing it this way, the program is loaded into Stata memory and is available for the entire Stata session, unless you drop it. The whole step of making it an ado-file is avoided. Sometimes this is nice, and sometimes it is easier to use an ado-file so it’s avialable instantly for all your projects. Chapter 1-10 (revision 16 May 2010) p. 7 Writing a program to optimize test characteristics We are now going to work through a rather complicated example for a very practical problem. It is a somewhat common research problem, or quality improvement problem, to determine the optimal cut-point for a continuous (interval scaled) diagnostic test variable to provide the best test characteristics (see box), such as sensitivity and specificity. For example, Carpenter et al (1995) did this to discover that 60% or greater carotid artery stenosis by duplex Doppler ultrasonography provided the best test characteristics when compared to the gold standard arteriography. Test Characteristics With the data in the required form for Stata: Gold Standard “true value” disease present ( + ) disease absent ( - ) Test “probable value” disease present ( + ) disease absent ( - ) a (true positives) b (false negatives) c (false positives) d (true negatives) a+c b+d a+b c+d We define the following terminology (Lilienfeld, 1994, p. 118-124), expressed as percents: sensitivity = (true positives)/(true positives plus false negatives) = (true positives)/(all those with the disease) = a / (a + b) 100 specificity = (true negatives)/(true negatives plus false positives) = (true negatives)/(all those without the disease) = d / (c + d) 100 Sensitivity and specificity provide information about the accuracy (validity) of a test. Positive and negative predictive values provide information about the meaning to the test results. The probability of disease being present given a positive test result is the positive predictive value (Lilienfeld, 1994, p. 118-124): positive predictive value = (true positives)/(true positives plus false positives) = (true positives)/(all those with a positive test result) = a / (a + c) 100 The probability of no disease being present given a negative test result is the negative predictive value (Lilienfeld, 1994, p. 118-124): negative predictive value = (true negatives)/(true negatives plus false negatives) = (true negatives)/(all those with a negative test result) Chapter 1-10 (revision 16 May 2010) p. 8 = d / (b + d) 100 “Unlike sensitivity and specificity, the positive and negative predictive values of a test depend on the prevalence rate of disease in the population. …For a test of given sensitivity and specificity, the higher the prevalence of the disease, the greater the positive predictive value and the lower the negative predictive value.” (Lilienfeld, 1994, p. 122-123) The overall accuracy, or simply accuracy, is simply the proportion of correct test decisions, and is defined as (without citation for now, Stoddard just knows this) overall accuracy = (true postives plus true negative)/(all tests) = (a + d)/(a + b + c + d) The area under the receiver operating characteristic curve, or simply ROC, for a dichotomous test and gold standard variable, or 2 × 2 table, is simply the simple average of the sensitivity and specificity (without citation for now, Stoddard just knows this) ROC = (sensitivity + specificity)/2 We will practice will the AngioData.dta file (see box) AngioData.dta dataset This file contains n=172 deindentified pairs of measurements provided by an anomonous researcher, with two continuous scored measurements of carotid artery stenosis. angio icapsv Gold Standard: arteriography (arteriographic stenosis) Diagnostic Test: internal carotid artery peak systolic velocity (PSVICA) AngioData Opening the data file, which is already in the the StataCourse\practice subdirectory, use angiodata, clear First, we will dichotomize the angio variable into 60% or greater carotid artery stenosis. recode angio 0/59=0 60/100=1 .=., gen(gold) For a first guess at a cutpoint for icapsv, we will use the mean sum icapsv Chapter 1-10 (revision 16 May 2010) p. 9 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------icapsv | 172 172.5407 141.6103 0 575 Defining a dichotomized icapsv variable, gen test = cond(icapsv>=172,1,0) replace test=. if icapsv==. and adding variable labels, label variable gold "angio" label variable test "icapsv" To calculate the diagnostic test characteristics, we must first update Stata to include the diagt command. While connected to the internet, findit diagt -------------------------------------------------------------------------------------search for diagt (manual: [R] search) -------------------------------------------------------------------------------------Keywords: Search: diagt (1) Official help files, FAQs, Examples, SJs, and STBs (2) Web resources from Stata and from other users Search of official help files, FAQs, Examples, SJs, and STBs SJ-4-4 sbe36_2 . . . . . . . . . . . . . . . . . . Software update for diagt (help diagt if installed) . . . . . . . . . . P. T. Seed and A. Tobias Q4/04 SJ 4(4):490 new options added to diagt STB-59 sbe36.1 . . . . . . . . . . . Summary statistics for diagnostic tests (help diagt if installed) . . . . . . . . . . P. T. Seed and A. Tobias 1/01 pp.9--12; STB Reprints Vol 10, pp.90--93 complete revision of diagtest to assess a simple diagnostic test in comparison with a reference standard; uses the exact binomial distribution and provides diagti, an immediate version of the command Click on the sbe36_2 link to install the ado file. If you are not connected to the internet, that is okay. The four files it adds (diagt.ado, diagt.hlp, diagti.ado, diagti.hlp) are in the StataCourse\practice subdirectory, which is now your current directory, so Stata will find these commands. Chapter 1-10 (revision 16 May 2010) p. 10 Computing the test characteristics, diagt gold test | icapsv angio | Pos. Neg. | Total -----------+----------------------+---------Abnormal | 44 12 | 56 Normal | 16 100 | 116 -----------+----------------------+---------Total | 60 112 | 172 True abnormal diagnosis defined as gold = 1 [95% Confidence Interval] --------------------------------------------------------------------------Prevalence Pr(A) 33% 26% 40.1% --------------------------------------------------------------------------Sensitivity Pr(+|A) 78.6% 65.6% 88.4% Specificity Pr(-|N) 86.2% 78.6% 91.9% ROC area (Sens. + Spec.)/2 .824 .761 .887 --------------------------------------------------------------------------Likelihood ratio (+) Pr(+|A)/Pr(+|N) 5.7 3.54 9.16 Likelihood ratio (-) Pr(-|A)/Pr(-|N) .249 .15 .413 Odds ratio LR(+)/LR(-) 22.9 10.1 52.1 Positive predictive value Pr(A|+) 73.3% 60.3% 83.9% Negative predictive value Pr(N|-) 89.3% 82% 94.3% --------------------------------------------------------------------------- This looks like a pretty good guess for a cutpoint for icapsv. To do this for every possible cutpoint for icapsv, we could simply put the commands inside a loop. Let’s begin to build a program inside the do-file, and run it for the first three values of icapsv. capture program drop optcut program define optcut foreach num of numlist 0 21 36 { capture drop test gen test = cond(icapsv>`num’,1,0) replace test=. if icapsv==. diagt gold test } end optcut That worked, but let’s turn scrolling off. capture program drop optcut program define optcut set more off foreach num of numlist 0 21 36 { capture drop test gen test = cond(icapsv>`num’,1,0) replace test=. if icapsv==. diagt gold test } set more on end optcut Chapter 1-10 (revision 16 May 2010) p. 11 Let’s make it so it will always work, no matter what the v ariables names are, by passing the gold and test variables as parameters. capture program drop optcut program define optcut args gold test local _test set more off foreach num of numlist 0 21 36 { capture drop _test gen _test = cond(`test’ >`num’,1,0) replace _test =. if test ==. diagt `gold’ _test } set more on end optcut gold icapsv Notice we named variables created by our program to begin with “_”, similar to what Stata does, to inform the user that it was created by the program. The “args” command informs Stata what variables are being based. If more than these two variables are passed, the additional variables are set to missing. If less than two variables are passed, the variables not passed are set to missing. In either case, Stata does not issue an error message. To make sure the user provides two variables, no fewer and no more, we can use the following: capture program drop optcut program define optcut syntax varlist(min=2 max=2) tokenize `varlist' local gold `1' local test `2' local _test set more off foreach num of numlist 0 21 36 { capture drop _test gen _test = cond(`test’ >`num’,1,0) replace _test =. if `test’ ==. diagt `gold’ _test } set more on end optcut gold icapsv Chapter 1-10 (revision 16 May 2010) p. 12 Next, let’s make use of the Stata command levelsof to pass all of the values of our test variable to the foreach command. capture program drop optcut program define optcut syntax varlist(min=2 max=2) tokenize `varlist' local gold `1' local test `2' local _test levelsof `test’, local(levels) set more off foreach num of local levels { capture drop _test gen _test = cond(`test’ >`num’,1,0) replace _test =. if test ==. diagt gold _test } set more on end optcut gold icapsv This crashes on the last value, but does great until then. .... there is still much to do. A much more completed, although much more complex, version is in chapter10.do. Chapter 1-10 (revision 16 May 2010) p. 13 Program to compute the statistic, Accuracy, which diagt does not provide. Here is a program you can use to compute accuracy, (a+c)/N. The last line is how to call it. * program to compute test characteric accuracy capture program drop accuracy program define accuracy , byable(recall) version 9 syntax varlist(min=2 max=2) [if] [in] tokenize `varlist' local goldvar `1' local testvar `2' quietly count quietly scalar N=r(N) quietly count if `goldvar'==0 & `testvar'==0 quietly scalar d=r(N) quietly count if `goldvar'==1 & `testvar'==1 quietly scalar a=r(N) display as result "Accuracy = (" %-2.0f a "+" %-2.0f d ")/" /// %-2.0f N " = " %-3.1f (a+d)/N*100 "%" end accuracy goldvar testvar Example of how to extend your program to enable the use of the “if” qualifier We now extend the accuracy program to enable the use of the “if”qualifier. * program to compute test characteric accuracy capture program drop accuracy program define accuracy , byable(recall) version 9 syntax varlist(min=2 max=2) [if] [in] tokenize `varlist' local goldvar `1' local testvar `2' tempname touse mark `touse' `wgt' `if' `in' preserve keep if `touse' quietly count quietly scalar N=r(N) quietly count if `goldvar'==0 & `testvar'==0 quietly scalar d=r(N) quietly count if `goldvar'==1 & `testvar'==1 quietly scalar a=r(N) display as result "Accuracy = (" %-2.0f a "+" %-2.0f d ")/" /// %-2.0f N " = " %-3.1f (a+d)/N*100 "%" restore end accuracy goldvar testvar if patientgroup==1 Chapter 1-10 (revision 16 May 2010) p. 14 References Carpenter JP, Lexa FJ, Davis JT. (1995). Determination of sixty percent or greater carotid artery stenosis by duplex Doppler ultrasonography. J Vasc Surg 22(6):697-705. Lilienfeld DE, Stolley PD (1994). Foundations of Epidemiology, 3rd ed., New York, Oxford University Press. Chapter 1-10 (revision 16 May 2010) p. 15