Chapter 1-99. Homework Problem Solutions

advertisement
Chapter 1-99. Homework Problem Solutions
Chapter 1-1. Installing Stata and recovering Stata windows
No problems.
Chapter 1-2. Getting data into Stata and some other basics
Problem 1) Creating a csv file
Open Microsoft Excel. Highlight the white cells (not the first row or first column) of the
following table, copy them, and paste them into Excel.
1
2
3
4
5
A
Controls
B
C
id
age
sbp
1
2
40
48
120
125
Save the file as a csv file, paying attention to which directory it goes into. Here are the
steps:
Inside Excel, click on top left icon (the Office Button),
Double Click Save As
File name: ch1-14-problem1.csv
Save as type: CSV (Comma delimited) (*.csv)
Save
Excel will then ask some questions:
“The selected file type does not support….” Answer OK (this is just Excel letting you
know only the worksheet you have open will be saved as a csv file)
“ch1-14-problem1.csv may contain features….” Answer Yes (this is just Excel letting
you know features like colored text and colored shading will not show up in
the csv file.
The file ch1-14-problem1.csv should now be in some directory, say My Documents.
Click on the X in the upper right corner to exit Excel
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah
School of Medicine. Chapter 1-99. (Accessed January 8, 2012, at http://www.ccts.utah.edu/biostats/
?pageId=5385).
Chapter 1-99 (revision 8 Jan 2012)
p. 1
Excel will then ask, “Do you want to save the changes..” Answer No, since you have
already saved the file. (Excel does not know if you made any additional changes
in between the save you just did and when you are now exiting Excel.)
The file ch1-14-problem1.csv should now be in some directory, perhaps My Documents.
Problem 2) Reading in a csv file from the Stata menu
Stata cannot open a csv file directly, but it can import it. From inside Stata, on the main
menu bar, click on
File
Import
ASCII data created by a spreadsheet
ASCII dataset filename: click on Browse
The Open box comes up. In the window Files of type,
while you are looking in the correct directory where you saved
the file in Problem 1),
Ask for “Comma Separated Values (*.csv)”
Click on “ch1-14-problem1.csv
Open
OK
You will see that Stata brought in the data and named the variables v1, v2, and v3.
This is all Stata could do, since it has no way to know that the variable names are in row
3.
Listing the data,
list
1.
2.
3.
4.
5.
+----------------------+
|
v1
v2
v3 |
|----------------------|
| Controls
|
|
|
|
id
age
sbp |
|
1
40
120 |
|
2
48
125 |
+----------------------+
we see the variable names on row 3 and notice that rows 1 and 2 have no value as data.
Problem 3) Having Stata delete first two rows and bring back in starting on row 3
Now do the trick found in Chapter 1-2, under the heading
Importing an Excel File Into Stata When the Variable Names are Not on the First Row
Chapter 1-99 (revision 8 Jan 2012)
p. 2
Cut-and-paste the following lines into the Stata do file editor.
* bring data back in using 3rd row as variable names
drop in 1/2
outsheet using temp1.csv, comma nonames replace
insheet using temp1.csv, clear names
erase temp1.csv
Highlight this block of Stata commands and hit the execute botton (last icon on the dofile menu bar).
Listing the data again,
list
+----------------+
| id
age
sbp |
|----------------|
1. | 1
40
120 |
2. | 2
48
125 |
+----------------+
We now have the data read in with the desired variable names.
Chapter 1-3. Cleaning data
Problem 1) Convert Excel file to an ASCII file (explained in Chapter 1-2)
Open the file homework1.xls inside Excel. Save it back out as an ASCII file with a csv
file extension, calling it homework1.csv. The contents of this file are:
id
1
2
3
4
5
6
sex
M
m
F
F
m
f
age
unknown
20
21
30
40
Problem 2) Reading ASCII file into Stata (explained in Chapter 1-2)
Part 1) Change your working directory to wherever the course manual data files are.
You can use the menu to help with this.
File
Change working directory…
Browse until you find the datasets & do-files directory
OK
Chapter 1-99 (revision 8 Jan 2012)
p. 3
. cd "C:\Documents and Settings\u0032770.SRVR\Desktop\Biostats & Epi With
Stata\datasets & do
> -files"
C:\Documents and Settings\u0032770.SRVR\Desktop\Biostats & Epi With
Stata\datasets & do-files
Part 2) Read the file homework1.csv into Stata using the “insheet” command (in
Command window, or do-file editor, rather than the menu Import option. The command
you need is:
insheet using homework1.csv, clear
Here the “clear” option was added. Actually, this option is only needed if data are already
in Stata memory.
Problem 3) Convert messy string variable to a numeric variable
In the do-file editor, create a numeric “female” variable from the string “sex” variable
using the inlist function. Use 1 = female and 0 = male.
capture drop female // optional
gen female = 1 if inlist(sex, "F", "f")
replace female = 0 if inlist(sex, "M", "m")
tab sex female, missing // check that it worked
|
female
sex |
0
1 |
Total
-----------+----------------------+---------F |
0
2 |
2
M |
1
0 |
1
f |
0
1 |
1
m |
2
0 |
2
-----------+----------------------+---------Total |
3
3 |
6
Problem 4) Clean up messy numeric variable that contains strings
Convert the age variable to numeric, after setting “unknown” to missing. (Something
similar was done for matage in Chapter 1-3.)
replace age="" if age=="unknown"
destring age, replace
sum age
list age
Chapter 1-99 (revision 8 Jan 2012)
p. 4
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
4
27.75
9.322911
20
40
1.
2.
3.
4.
5.
6.
+-----+
| age |
|-----|
|
. |
| 20 |
| 21 |
| 30 |
|
. |
|-----|
| 40 |
+-----+
Problem 5) Save this cleaned up dataset as a Stata formatted file.
save homework1, replace
(note: file homework1.dta not found)
file homework1.dta saved
Chapter 1-4. Merging files
Problem 1) Match Merge
The Excel spreadsheet, called chapter1-4exercise.xls has two sheets of data. The first
sheet is called demographics, and the second is called qol.
Sheet 1: demographics
id
age
1
2
3
4
5
female
15
29
18
20
22
1
1
0
0
0
Sheet 2: qol
id
quality of life
1
2
3
5
2
2
3
4
Save each sheet into an ASCII file, such as a csv file, perform a match-merge on id, and
then list the data. (If needed, look at Chapter 1-2 Problem 1, above, to see how to save a
worksheet in Excel to a csv file.)
Your solution should look like:
Chapter 1-99 (revision 8 Jan 2012)
p. 5
. list
1.
2.
3.
4.
5.
+------------------------------------------------+
| id
age
female
qualit~e
_merge |
|------------------------------------------------|
| 1
15
1
2
matched (3) |
| 2
29
1
2
matched (3) |
| 3
18
0
3
matched (3) |
| 4
20
0
.
master only (1) |
| 5
22
0
4
matched (3) |
+------------------------------------------------+
To avoid abbreviating the variable names to 8 characters, you can use the “abbrev( 15)
option on the list to allow up to 15 characters per column
. list , abbrev(15)
1.
2.
3.
4.
5.
+-----------------------------------------------------+
| id
age
female
qualityoflife
_merge |
|-----------------------------------------------------|
| 1
15
1
2
matched (3) |
| 2
29
1
2
matched (3) |
| 3
18
0
3
matched (3) |
| 4
20
0
.
master only (1) |
| 5
22
0
4
matched (3) |
+-----------------------------------------------------+
Finally, save the merged file into a new file name (Stata formattted file).
A good way to keep track of where the csv files came from is to append the sheet name
onto the Excel file name. If you did it this way, you would have saved the demographics
sheet into the file:
chapter1-4exercise_demographics.csv
and the qol sheet into the file:
chapter1-4exercise_qol.csv
Any file names you chose would have been fine, as well.
After changing to the directory where these files are located, perhaps using the change
directory (cd) command or the “change working directory” menu option (as was done in
Chapter 1-3 problem 2 above), you could use the following,
insheet using chapter1-4exercise_qol.csv, clear
sort id
save chapter1-4exercise_qol, replace
insheet using chapter1-4exercise_demographics.csv, clear
sort id
merge 1:1 id using chapter1-4exercise_qol
list , abbrev(15)
save chapter1-4final, replace
1.
2.
3.
4.
5.
+-----------------------------------------------------+
| id
age
female
qualityoflife
_merge |
|-----------------------------------------------------|
| 1
15
1
2
matched (3) |
| 2
29
1
2
matched (3) |
| 3
18
0
3
matched (3) |
| 4
20
0
.
master only (1) |
| 5
22
0
4
matched (3) |
+-----------------------------------------------------+
Chapter 1-99 (revision 8 Jan 2012)
p. 6
Chapter 1-5. Labeling variables and values
Problem 1) Assigning Labels
Cut-and-paste the following lines into the Stata do file editor. Highlight this block of
Stata commands and hit the execute botton (last icon on the do-file menu bar) to load the
dataset into Stata.
clear
input id dose
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 3
9 3
end
list
Chapter 1-99 (revision 8 Jan 2012)
p. 7
1.
2.
3.
4.
5.
6.
7.
8.
9.
+-----------+
| id
dose |
|-----------|
| 1
1 |
| 2
1 |
| 3
1 |
| 4
2 |
| 5
2 |
|-----------|
| 6
3 |
| 7
3 |
| 8
3 |
| 9
3 |
+-----------+
Assign the following labels to the dose variable (variable label and value labels):
dose: ibuprofen dose category
1:
1) low dose
2:
2) mod dose
3:
3) high dose
and generate a frequency table. Your solution show look like:
ibuprofen |
dose |
category |
Freq.
Percent
Cum.
-------------+----------------------------------1) low dose |
3
33.33
33.33
2) mod dose |
2
22.22
55.56
3) high dose |
4
44.44
100.00
-------------+----------------------------------Total |
9
100.00
label variable dose "ibuprofen dose category"
label define doselab 1 "1) low dose" 2 "2) mod dose" ///
3 "3) high dose"
label values dose doselab
tab dose
ibuprofen |
dose |
category |
Freq.
Percent
Cum.
-------------+----------------------------------1) low dose |
3
33.33
33.33
2) mod dose |
2
22.22
55.56
3) high dose |
4
44.44
100.00
-------------+----------------------------------Total |
9
100.00
Chapter 1-99 (revision 8 Jan 2012)
p. 8
Chapter 1-6. Basic graphics
Problem 1) Publication quality graph
Anker et al (2009), in their Figure 2, Panel C, present a mean change from baseline graph
with error bars that represent standard errors. The results are a 6-minute-walk test, with a
separate line for the ferric carboxymaltose (FCM) group and the placebo group.
Measures occurred at baseline, week 4, 12, and 24. By visual examination of the graph,
the data are approximately,
Mean changes from baseline with standard errors
Group
Baseline Week 4 Week 12 Week 24
FCM
0
21±3
39±4
38.5±5
Placebo 0
2±5
3±6
10±6
The assignment is to create a graph in the do-file editor. First, input the mean change and
the lower and upper bounds of the error bars, something like what was done in Chapter 16, p.27. Then, using other commands, or graph options found in that chapter, create the
following black-and-white graph:
50
P<0.001
P<0.001
P<0.001
40
FCM
30
20
Placebo
10
0
-10
0
4
Chapter 1-99 (revision 8 Jan 2012)
8
12
16
Weeks since Randomization
20
24
p. 9
Hint: add one feature at a time. That way, it is easier to discover what the error message
goes with (it goes with the last thing you added).
Hint: The following table shows where Stata positions titles, subtitles, etc. To add some
white space on the right side of graph, just use
r1title(" ")
to put a blank title on that side.
l2title
l1title
r1title
r2title
title
subtitle
t2title
t1title
b1title
b2title
legend
note
caption
Work on it long enough to get most of the features. Stop and look at the solution when it
gets to the point of being more frustrating than it is fun.
Chapter 1-99 (revision 8 Jan 2012)
p. 10
The graph was drawn using the following Stata comands:
clear
input week fcm change lower upper
0 1 0 . .
4 1 21 18 24
12 1 39 35 43
24 1 38.5 33.5 43.5
0 0 0 . .
4 0 2 -3 7
12 0 3 -3 9
24 0 10 4 16
end
list
*
sort week
#delimit ;
twoway
(scatter change week if fcm==1, msymbol(diamond) mlcolor(black)
mfcolor(black))
(rcap lower upper week if fcm==1, color(black))
(line change week if fcm==1, lcolor(black))
(scatter change week if fcm==0, msymbol(circle) mlcolor(black)
mfcolor(white) msize(*1.5))
(rcap lower upper week if fcm==0, color(black))
(line change week if fcm==0, lcolor(black))
, scheme(s1mono) legend(off)
plotregion(style(none))
ytitle("Change in Distance (m)")
xtitle("Weeks since Randomization" , height(5))
ylabels(-10(10)50, angle(horizontal))
xlabels(0(4)24)
text(35 17 "FCM", placement(e))
text(11 19 "Placebo", placement(e))
text(50 4 "P<0.001", placement(c))
text(50 12 "P<0.001", placement(c))
text(50 24 "P<0.001", placement(c))
r1title(" ") /* blank y-title on right side to add white space */
;
#delimit cr
Chapter 1-99 (revision 8 Jan 2012)
p. 11
Chapter 1-7. Looping, collapsing, and reshaping
Problem 1) reshaping
In Hand et al (1994, p.7), a dataset which was taken from Snedecor and Cochran (1967,
p.347), contains the weight gain in rats. Hand et al give this description,
“The data come from an experiment to study the gain in weight of rats fed on four
different diets, distinguished by amount of protein (low and high) and by source of
protein (beef and cereal). The design of the experiment is completely randomized
with ten rates on each of the four treatments (which have a complete factorial
structure).”
Cut-and-paste the following into the do-file editor, highlight, and execute it to set up the
dataset.
clear
input
90
76
90
64
86
51
72
90
95
78
end
gain1
73
102
118
104
81
107
100
87
117
111
gain2 gain3 gain4
107
98
95
74
97
56
80
111
98
95
74
88
74
82
67
77
89
86
58
92
These data represent the weight gain in rats for four groups:
gain1 = beef low
gain 2 = beef high
gain 3 = cereal low
gain 4 = cereal high
To analyze these data with independent group t-tests, we need a variable for group which
contains the numbers 1 to 4, and then a variable for weight gain. That is, we need a long
format structure.
Convert these data from the present width structure to long structure, creating a variable
called group and gain. To make this work, you will need an identification number that
uniquely defines the rows. You could use:
gen tempid = _n
The following would work,
Chapter 1-99 (revision 8 Jan 2012)
p. 12
gen tempid = _n
reshape long gain , i(tempid) j(group)
drop tempid
Problem 2) Value labels
Assign the following value labels to the variable group:
1 = beef low
2 = beef high
3 = cereal low
4 = cereal high
You can refer to Chapter 1-5 to recall how to do this.
The following would work,
label define grouplab 1 "1) beef low" 2 "2) beef high" ///
3 "3) cereal low" 4 "4) cereal high"
label values group grouplab
tab group
Problem 3) All possible t-tests
Now, we want to compute all the possible t-tests (1 vs 2)(1 vs 3)(1 vs 4)(2 vs 3)(2 vs 4)
(3 vs 4). (Note: In an actual data analysis, you would then most likely apply a multiple
comparison procedure to the p values, as described in Chapter 2-8, but that is not part of
this problem.) For the groups 1 and 2 comparison, we will need,
ttest gain if group==1 | group==2, by(group)
Rather than putting six t-test lines in the do-file, see if you can do it with two for loops.
We did something quite close to this in Chapter 1-7, page 7, to create an upper triangular
matrix
* -- multiplication table (upper triangular matrix): attempt 2
* r = row , c = col
forvalues r = 1/3 {
local m=(`r'-1)*2
display _skip(`m') _continue
forvalues c = `r'/3 {
display `r'*`c' " " _continue
}
display // display nothing goes to next line
}
1 2 3
4 6
9
Chapter 1-99 (revision 8 Jan 2012)
p. 13
Hint: If you get the following error:
1 group found, 2 required
r(420);
it means you are trying to use the same group twice in the t-test—the two groups must be
different.
If you tried the following, which is a good first attempt,
forvalues i = 1/3 {
forvalues j = `i'/4 {
ttest gain if group==`i' | group==`j' , by(group)
}
}
*
you would get the following error:
1 group found, 2 required
r(420);
This error would come from trying to do a t-test between groups 1 and 1. The j counter
needs to always be at least one larger than the i counter.
If you tried the following, which is also a good attempt,
forvalues i = 1/3 {
forvalues j = 2/4 {
ttest gain if group==`i' | group==`j' , by(group)
}
}
*
you would get three t-tests and then that same error:
1 group found, 2 required
r(420);
which could come from trying to do a t-test between groups 2 and 2.
Here is a solution that works,
forvalues i = 1/3 {
local jstart = `i'+1
forvalues j = `jstart'/4 {
ttest gain if group==`i' | group==`j' , by(group)
}
}
*
It is true, just putting the six t-test lines in would have been faster. This solution ended
up taking six lines, anyway. But if you are geeky enough, this was a fun challenge.
Chapter 1-99 (revision 8 Jan 2012)
p. 14
Chapter 1-8. Operators, ifs, dates, and times
Problem 1) numeric operators
A frequent research problem is computing BMI using the following formula:
body mass index (BMI) = weight/height2 (units: kg/m2)
If the data are height in inches and weight in pounds, then a conversation is first needed.
Cut-and-paste the following into the do-file editor, highlight and execute it to set up the
dataset.
clear
input heightin weightlbs
56 200
63 240
50 125
60 127
48 150
57 180
58 210
56 185
60 220
61 310
60 180
59 175
60 100
61 98
59 90
58 80
58 145
65 150
end
lab var heightin "height (inches)"
lab var weightlbs "weight (pounds)"
To use the BMI formula, we need height in meters. The conversion formula is:
1 inch = 0.0254 meter
We also need weight in kilograms. The conversion formula is:
1 pound = 0.4536 kilogram
Generate a BMI variable.
We could do it three steps, creating the intermediate variables and then BMI:
Chapter 1-99 (revision 8 Jan 2012)
p. 15
gen heightm = heightin*0.0254 // convert from inches to meters
gen weightkg = weightlbs*0.4536 // convert from pounds to kg
gen bmi = weightkg/heightm^2
Or, it could be done on one line,
gen bmi = (weightlbs*0.4536)/(heightin*0.0254)^2
Problem 2) BMI categories
Cut-and-paste the following into the do-file editor, highlight and execute it to set up the
dataset.
clear
input id bmi
1 17.44444
2 18.49999
3 18.50000
4 18.50001
5 18.99999
6 24.00000
7 25.00000
8 25.00001
9 .
10 27.33333
11 29.99999
12 30.00000
13 32.44444
end
list
BMI categories recommended by the National Heart, Lung, and Blood Institute
(1998)(Onyike et al., 2003) are:
underweight (BMI <18.5)
normal weight (BMI 18.5–24.9)
overweight
(BMI 25.0–29.9)
obese
(BMI 30)
Create a variable, bmicat, that has scores 1 to 4, representing the four BMI categories. A
recode command is the easiest way, which we will do in Problem 3, but first try it with a
generate command, to create the first category, followed by some replace commands for
the other three categories.
Hint: one of the replace commands might be,
replace bmicat = 2 if 18.5<= bmi & bmi<25.0
End with the following comands to check your work,
list bmi bmicat
bysort bmicat: sum bmi
Chapter 1-99 (revision 8 Jan 2012)
p. 16
A first attempt might have been,
capture drop bmicat
gen bmicat = 1 if bmi<18.5
replace bmicat = 2 if 18.5<= bmi & bmi<25.0
replace bmicat = 3 if 25.0<= bmi & bmi<30.0
replace bmicat = 4 if bmi>=30
list bmi bmicat
bysort bmicat: sum bmi
but it would not have dealt with the missing value correctly. Missing data are stored as a
very large number (plus infinity if you want to think of it like that). Therefore,
(bmi>=30) evaluates to true, so a 4 is assigned.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
+-------------------+
|
bmi
bmicat |
|-------------------|
| 17.44444
1 |
| 18.49999
1 |
|
18.5
2 |
| 18.50001
2 |
| 18.99999
2 |
|-------------------|
|
24
2 |
|
25
3 |
| 25.00001
3 |
|
.
4 | <- wrong answer
| 27.33333
3 |
|-------------------|
| 29.99999
3 |
|
30
4 |
| 32.44444
4 |
+-------------------+
The following will work,
capture drop bmicat
gen bmicat = 1 if bmi<18.5
replace bmicat = 2 if 18.5<= bmi & bmi<25.0
replace bmicat = 3 if 25.0<= bmi & bmi<30.0
replace bmicat = 4 if bmi>=30 & bmi~=.
list bmi bmicat
bysort bmicat: sum bmi
+-------------------+
|
bmi
bmicat |
|-------------------|
1. | 17.44444
1 |
2. | 18.49999
1 |
3. |
18.5
2 |
4. | 18.50001
2 |
5. | 18.99999
2 |
|-------------------|
6. |
24
2 |
7. |
25
3 |
8. | 25.00001
3 |
9. |
.
. |
10. | 27.33333
3 |
|-------------------|
11. | 29.99999
3 |
Chapter 1-99 (revision 8 Jan 2012)
<- right answer
p. 17
12. |
30
4 |
13. | 32.44444
4 |
+-------------------+
-> bmicat = 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bmi |
2
17.97222
.7463863
17.44444
18.49999
-> bmicat = 2
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bmi |
4
20
2.677062
18.5
24
-> bmicat = 3
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bmi |
4
26.83333
2.380469
25
29.99999
-> bmicat = 4
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bmi |
2
31.22222
1.728479
30
32.44444
-> bmicat = .
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bmi |
0
Notice that the minimums and maximuns agree with the category endpoints.
Problem 3) BMI categories
Use the Problem 2) dataset, again. This time, create the BMI categories using a recode
statement.
Here are the categories, again.
underweight (BMI <18.5)
normal weight (BMI 18.5–24.9)
overweight
(BMI 25.0–29.9)
obese
(BMI 30)
Hint: a recode for categorizing age might be,
capture drop agecat
recode age 90/max=5 80/90=4 70/80=3 60/70=2 , min/60=1 , gen(agecat)
list age agecat // check our work
bysort agecat: sum age // check our work
Chapter 1-99 (revision 8 Jan 2012)
p. 18
Forming the categories in reverse order solves the problem of decimal places. For
example, using “90/max=5” assigns 90 to 5, so when “80/90=4” comes up next, it
translates into 80/89.999999=4”. That is, once a number in a range is assigned, it stays
assigned. Also, since we did not assign missing, using “.=99”, for example, it remains
missing (remains unassigned).
The following would work,
capture drop bmicat
recode bmi 30/max=4 25/30=3
list bmi bmicat
bysort bmicat: sum bmi
18.5/25=2
min/18.5=1 ,gen(bmicat)
Problem 4) Concatenating (combining) strings
In a study were upper respiratory disease symptoms were abstracted from the patient
medical record, the investigator wanted to create a string that showed the combination of
symptoms.
Cut-and-paste the following into the do-file editor, highlight and execute it to set up the
dataset and create the symptoms string variable.
clear
input runnynose cough congestion throatpain earpain fever
1 1 1 1 1 1
1 0 1 1 0 0
0 1 0 0 1 1
0 0 1 0 1 0
0 0 0 1 0 0
1 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 1 0
end
*
* -- create a variable showing all symptoms found in medical record
* run = runnynose
* cou = cough
* con = congestion
* thr = throat pain
* ear = ear pain
* fev = fever
capture drop symptoms
gen str20 symptoms = ""
replace symptoms = "run," if runnynose==1
replace symptoms = "...," if runnynose==0
replace symptoms=symptoms+"cou," if cough==1
replace symptoms=symptoms+"...," if cough==0
replace symptoms=symptoms+"con," if congestion==1
replace symptoms=symptoms+"...," if congestion==0
replace symptoms=symptoms+"thr," if throatpain==1
replace symptoms=symptoms+"...," if throatpain==0
replace symptoms=symptoms+"ear," if earpain==1
replace symptoms=symptoms+"...," if earpain==0
replace symptoms=symptoms+"fev" if fever==1
replace symptoms=symptoms+"..." if fever==0
tab symptoms
Chapter 1-99 (revision 8 Jan 2012)
p. 19
symptoms |
Freq.
Percent
Cum.
------------------------+----------------------------------...,...,...,...,...,... |
1
11.11
11.11
...,...,...,...,...,fev |
1
11.11
22.22
...,...,...,...,ear,... |
1
11.11
33.33
...,...,...,thr,...,... |
1
11.11
44.44
...,...,con,...,ear,... |
1
11.11
55.56
...,cou,...,...,ear,fev |
1
11.11
66.67
run,...,...,...,...,... |
1
11.11
77.78
run,...,con,thr,...,... |
1
11.11
88.89
run,cou,con,thr,ear,fev |
1
11.11
100.00
------------------------+----------------------------------Total |
9
100.00
In knee surgery, when an artificial vein is implanted to replace a damaged vein, three
popular drugs to prevent clotting are coumadin, plavix, and aspirin (ASA), either
separately or in combination.
The researcher wants to see a frequency table of these combinations in an easy to read
format. Cut-and-paste the following into the do-file editor, highlight and execute it to set
up the dataset. Then, create a drug combination string variable similar to the symptoms
variable just demonstrated.
input coumadin plavix aspirin
1 0 0
1 1 0
0 1 0
0 0 0
. . .
0 0 1
0 1 1
1 1 1
end
list
The following would work,
* -- create a variable showing all symptoms found in chart review
* cou = coumadin
* plv = plavix
* asa = aspirin
capture drop drugs
gen str11 drugs = ""
replace drugs = "cou," if coumadin==1
replace drugs = "...," if coumadin==0
replace drugs=drugs+"plv," if plavix==1
replace drugs=drugs+"...," if plavix==0
replace drugs=drugs+"asa," if aspirin==1
replace drugs=drugs+"...," if aspirin==0
list
tab drugs, missing
Chapter 1-99 (revision 8 Jan 2012)
p. 20
Problem 5) Hierarchical combinations
Returning to the knee surgery study in Problem 4), the investigator wants to create a drug
group variable. A decision was made that a combination is a more correct category than a
single drug. So, the patient would be assigned to the combination group, rather than the
single drug. In this way, a patient can only belong to one group. Set up the Problem 4
dataset again, and then create a drug group variable with the following categories:
1 = ASA + coumadin + plavix
2 = coumadin + plavix
3 = ASA + coumadin
4 = ASA + plavix
5 = coumdin
6 = plavix
7 = ASA
8 = no drug
Be sure to check your work.
On a first attempt, you might try something like,
capture drop group
gen group=1 if aspirin==1 & coumadin==1 & plavix==1
replace group=2 if coumadin==1 & plavix==1
replace group=3 if aspirin==1 & coumadin==1
replace group=4 if aspirin==1 & plavix==1
replace group=5 if coumadin==1
replace group=6 if plavix==1
replace group=7 if aspirin==1
replace group=8 if aspirin==0 & coumadin==0 & plavix==0
list
1.
2.
3.
4.
5.
6.
7.
8.
+-------------------------------------+
| coumadin
plavix
aspirin
group |
|-------------------------------------|
|
1
0
0
5 |
|
1
1
0
6 | <- wrong answer
|
0
1
0
6 |
|
0
0
0
8 |
|
.
.
.
. |
|-------------------------------------|
|
0
0
1
7 |
|
0
1
1
7 | <- wrong answer
|
1
1
1
7 | <- wrong answer
+-------------------------------------+
The reason that does not work is that the group can be reassigned as you go along.
In a really large dataset, where it is unfeasible to check the classification by listing the
data, you can use a frequency table after each line to make sure that nothing gets
reclassified. With each replace command, assign a category only if a category has not
already been assigned. The following would work,
capture drop group
Chapter 1-99 (revision 8 Jan 2012)
p. 21
gen group=1 if aspirin==1 &
tab group
replace group=2 if group==.
tab group
replace group=3 if group==.
tab group
replace group=4 if group==.
tab group
replace group=5 if group==.
tab group
replace group=6 if group==.
tab group
replace group=7 if group==.
tab group
replace group=8 if group==.
tab group
list
1.
2.
3.
4.
5.
6.
7.
8.
coumadin==1 & plavix==1
& coumadin==1 & plavix==1
& aspirin==1 & coumadin==1
& aspirin==1 & plavix==1
& coumadin==1
& plavix==1
&aspirin==1
& aspirin==0 & coumadin==0 & plavix==0
+-------------------------------------+
| coumadin
plavix
aspirin
group |
|-------------------------------------|
|
1
0
0
5 |
|
1
1
0
2 |
|
0
1
0
6 |
|
0
0
0
8 |
|
.
.
.
. |
|-------------------------------------|
|
0
0
1
7 |
|
0
1
1
4 |
|
1
1
1
1 |
+-------------------------------------+
This time, all classifications are correct.
1 = ASA + coumadin + plavix
2 = coumadin + plavix
3 = ASA + coumadin
4 = ASA + plavix
5 = coumdin
6 = plavix
7 = ASA
8 = no drug
Problem 6) Dates
Cut-and-paste the following into the do-file editor, highlight and execute it to set up the
dataset.
clear
input str10 visit1date str10 visit2date
"03/04/2000" "03/10/2000"
"05/06/2001" "05/15/2001"
end
list
Create a variable that represents the number of days between the two visits.
Chapter 1-99 (revision 8 Jan 2012)
p. 22
The following would work,
gen date1 = date(visit1date,"MDY")
gen date2 = date(visit2date,"MDY")
gen followupdays = date2 - date1
list
+----------------------------------------------------+
| visit1date
visit2date
date1
date2
follow~s |
|----------------------------------------------------|
1. | 03/04/2000
03/10/2000
14673
14679
6 |
2. | 05/06/2001
05/15/2001
15101
15110
9 |
+----------------------------------------------------+
Chapter 1-9. More graphics: popular scientific graphs
Chapter 1-10. Programming Stata
Chapter 1-11. Compilation of frequently used variable generation and modifying
commands
Chapter 1-12. Stata results into Excel & Word
References
Anker SD, Colet JC, Filippatos G, et al. (2009). Ferric carboxymaltose in patients with heart
failure and iron deficiency. N Engl J Med 361(25):2436-48.
Hand DJ, Daly F, Lunn AD, McConway KJ, Osterowski E, editors. (1994). A Handbook of
Small Data Sets. New York, Chapman & Hall.
Onyike CU, Crum RM, Lee HB, Lyketsos CG, Eaton WW. (2003). Is obesity associated with
major depression? Results from the third national health and nutrition examination
survey. Am J Epidemiol 158(12):1139-1153.
Snedecor GW, Cochran GC. (1967). Statistical Methods, 6th ed, Ames, Iowa, Iowa State
University Press.
Chapter 1-99 (revision 8 Jan 2012)
p. 23
Download