Lesson 5 - Topics • Creating new variables in the data step • SAS Functions • Programs 5-6 in course notes • LSB 3:1-6,11-12 Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); • Indirect assignments (if/then/else) if age < 50 then young = 1; else young = 2; if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; Direct Assignments (Formulas) • Example c = a + b ; So if a = 2, b =3, c = 5; What if a is missing, what is c? C will be missing What if b is missing? If/then/else Statements With if-then-else definitions SAS stops executing after the first true statement if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; What What What What if if if if income income income income is is is is 10? 23? 30? missing? Tax Tax Tax Tax = = = = 1 2 3 1 Creating New Variables Create a new variable with 2 levels, one for college graduates and one for non-college graduates. Program 5 DATA tdata; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. New variable defines go after the input statement @ 49 educ 1. @123 sbp12 3. ; * This way will code missing values to the value 2; if educ < 7 then grad1 = 2 ; else if educ >=7 then grad1 = 1 ; * The next two ways are equivalent and are correct; if educ < 7 and educ ne . then grad2 = 2; else if educ >=7 then grad2 = 1; * IN is a special function in SAS ; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1; PROC FREQ DATA=tdata; TABLES educ grad1 grad2 grad3 ; Cumulative Cumulative educ Frequency Percent Frequency Percent --------------------------------------------------------1 3 3.03 3 3.03 3 4 4.04 7 7.07 4 23 23.23 30 30.30 5 14 14.14 44 44.44 6 12 12.12 56 56.57 7 16 16.16 72 72.73 8 10 10.10 82 82.83 9 17 17.17 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad1 Frequency Percent Frequency Percent ----------------------------------------------------------1 43 43.00 43 43.00 2 57 57.00 100 100.00 Cumulative Cumulative grad2 Frequency Percent Frequency Percent ----------------------------------------------------------1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad3 Frequency Percent Frequency Percent ----------------------------------------------------------1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Coded the missing value for educ to 2 PROC FREQ DATA=tdata; TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW NOCOL; TITLE 'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad1 educ grad1 Frequency‚ 1‚ 2‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ . ‚ 0 ‚ 1 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 3 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 0 ‚ 4 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 0 ‚ 23 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 5 ‚ 0 ‚ 14 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 6 ‚ 0 ‚ 12 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 7 ‚ 16 ‚ 0 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 8 ‚ 10 ‚ 0 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 9 ‚ 17 ‚ 0 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 43 57 Total 1 3 4 23 14 12 16 10 17 100 This shows that the missing value for educ got assigned a value of 2 * Recode sbp12 into 3 levels; if if if if sbp12 sbp12 sbp12 sbp12 = . < 120 < 140 >=140 then then then then sbp12c sbp12c sbp12c sbp12c = = = = . 1 2 3 ; else ; else ; else ; With if-then-else definitions SAS stops executing after the first true statement Values < 120 will be assigned value of 1 Values 120-139 will be assigned value of 2 Values >=140 will be assigned value of 3 Missing values will be assigned to missing PROC FREQ DATA=tdata; TABLES sbp12c sbp12; RUN; OUTPUT Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 Cumulative Cumulative sbp12 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 93 1 1.09 1 1.09 94 1 1.09 2 2.17 101 1 1.09 3 3.26 104 1 1.09 4 4.35 105 1 1.09 5 5.43 (more values) 147 1 1.09 87 94.57 148 1 1.09 88 95.65 149 1 1.09 89 96.74 153 1 1.09 90 97.83 154 1 1.09 91 98.91 158 1 1.09 92 100.00 Frequency Missing = 8 * Easy but costly error to make; if if if if sbp12 sbp12 sbp12 sbp12 = . < 120 < 140 >=140 then then then then sbp12c sbp12c sbp12 sbp12c PROC FREQ DATA=tdata; TABLES sbp12c; RUN; = = = = . 1 2 3 ; else ; else ; else ; How come no values of 2 and why so many missing? The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51 Important Facts When Creating New Variable 1. New variables are initialized to missing 2. Missing values are < any value if var < value (true if var is missing) 3. Reference missing values for numeric variables as . 4. Reference missing values for character variables as ' ' if sbp = . then ... (or if missing(sbp)) if clinic = ' ' then ... SAS Handling of Missing Data When Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); If any variable on the right-hand side is missing then the new variable will be missing • Indirect assignments if age < 50 then young = 1; else young=2; New variables are initialized to missing but may be given a value if any of the IF statements are true Checks you can make to be sure new variables are created correctly 1. Display original and new variables. PROC PRINT DATA=tdata (OBS=20); VAR educ college ; 2. Run PROC MEANS on original and new variable. Make sure both variables have same number of missing values. PROC MEANS DATA=tdata; VAR educ college; 3. Run PROC FREQ on original and new variable. PROC FREQ DATA=tdata; TABLES educ college educ*college; What Value to Set New Variable if age < 20 then teenager = 1; else if age >=20 then teenager = 2; if age < 20 then teenager = 1; else if age >=20 then teenager = 0; if age < 20 then teenager = ‘YES’; else if age >=20 then teenager = ‘NO’; If-then-do statements * Conditionally execute several statements; * Create indicator variables for race; * Make sure race variable not missing; if race ne . white = 0; black = 0; asian = 0; other = 0; if race = 1 if race = 2 if race = 3 if race = 4 end; then do; then then then then white black asian other = = = = 1; 1; 1; 1; DO LOOPS WITH ARRAYS - Used to Shorten Code - Used when repeating same code - Used with DO/END loop ARRAY wtlb(3) wt1 wt2 wt3; ARRAY wtkg(3) newwt1 newwt2 newwt3; DO index = 1 to 3; wtkg(index) = wtlb(index) / 2.2; END; /* same as the following code Newwt1 = wt1 / 2.2 ; Newwt2 = wt2 / 2.2; Newwt3 = wt3 / 2.2; *************************************/ * Program 6 SAS Functions ; DATA example; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @058 height 4.1 @085 weight 5.1 @172 ursod 3. @236 (se1-se10) (1.0 + 1); bmi rbmi1 rbmi2 lursod = = = = (weight*703.0768)/(height*height); ROUND(bmi,1); ROUND(bmi,.1); LOG(ursod); seavg = MEAN (OF se1-se10); semax = MAX (OF se1-se10); semin = MIN (OF se1-se10); * Use of dash notation ; seavg = MEAN (OF se1-se10); This is the same as seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); The OF is very important. Otherwise SAS thinks you are subtracting se10 from se1. To use this notation the ROOT of the name must be the same. * Two ways of computing average ; seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); Versus seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10; Using mean function computes the average of nonmissing values. Result is missing only if all values all missing. Using + formula requires all values be non-missing otherwise result will be missing if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10); What does this statement do? * Compute 10 new variables, 100 if se is present and 0 if not; ARRAY se (10) se1-se10; ARRAY hse(10) hse1-hse10; New variables DO senumber = 1 to 10; if se(senumber) = 1 then hse(senumber) = 0; else if se(senumber) in(2,3,4) then hse(senumber) = 100; END; *** For senumber = 1 the code is ************* if se1 = 1 then hse1 = 0; else if se1 in(2,3,4) then hse1 = 100; PROC PRINT DATA = example (OBS=10); VAR bmi rbmi1 rbmi2 seavg semin semax ; TITLE 'Listing of Selected Data for 10 Patients '; RUN; PROC FREQ DATA = example; TABLES semax; TITLE 'Distribution of Worse Side Effect Value'; TITLE2 'Side Effect Scores Range from 1 to 4'; RUN; PROC MEANS DATA = example; VAR hse1-hse10; TITLE 'Percent of Patients With Condition by Condition'; RUN; ods graphics on; PROC UNIVARIATE DATA = example ; VAR ursod lursod; QQPLOT ursod lursod; TITLE 'Quantile Plots for Urine Sodium Data'; RUN; Listing of Selected Data for 10 Patients Obs bmi rbmi1 1 28.2620 28 2 35.9963 3 rbmi2 seavg semin semax 28.3 1.1 1 2 36 36.0 1.0 1 1 27.0489 27 27.0 1.0 1 1 4 28.2620 28 28.3 1.1 1 2 5 33.2008 33 33.2 1.0 1 1 6 27.7691 28 27.8 1.2 1 2 7 32.6040 33 32.6 1.0 1 1 8 22.4057 22 22.4 1.2 1 2 9 37.2037 37 37.2 1.1 1 2 10 33.1717 33 33.2 1.7 1 3 Distribution of Worse Side Effect Value Side Effect Scores Range from 1 to 4 The FREQ Procedure semax Frequency Cumulative Cumulative Frequency Percent Percent ---------------------------------------------------------1 33 33.00 33 33.00 2 52 52.00 85 85.00 3 13 13.00 98 98.00 4 2 2.00 100 100.00 2 patients had at least 1 severe side effect Percent of Patients With Condition by Condition Type The MEANS Procedure Variable hse1 hse2 hse3 hse4 hse5 hse6 hse7 hse8 hse9 hse10 N Mean Std Dev Minimum Maximum 100 100 100 100 100 100 100 100 100 100 12.0000000 21.0000000 8.0000000 13.0000000 10.0000000 30.0000000 16.0000000 31.0000000 7.0000000 14.0000000 32.6598632 40.9360181 27.2659924 33.7997669 30.1511345 46.0566186 36.8452949 46.4823199 25.6432400 34.8735088 0 0 0 0 0 0 0 0 0 0 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 These means are percent of patients with se Log transformed value shows a better linear pattern