Lesson 4 - Topics • Creating new variables in the data step • SAS Functions Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); • Indirect assignments (if/then/else) if age < 50 then young = 1; else young = 2; if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; Direct Assignments (Formulas) • Example c = a + b ; So if a = 2, b =3, c = 5; What if a is missing, what is c? C will be missing What if b is missing? If/then/else Statements With if-then-else definitions SAS stops executing after the first true statement if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; What What What What if if if if income income income income is is is is 10? 23? 30? missing? Tax Tax Tax Tax = = = = 1 2 3 1 Creating New Variables Create a new variable with 2 levels, one for college graduates and one for non-college graduates. Program 5 DATA tdata; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. New variable defines go after the input statement @ 49 educ 1. @123 sbp12 3. ; * This way will code missing values to the value 2; if educ < 7 then grad1 = 2 ; else if educ >=7 then grad1 = 1 ; * The next two ways are equivalent and are correct; if educ < 7 and educ ne . then grad2 = 2; else if educ >=7 then grad2 = 1; * IN is a useful function in SAS ; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1; PROC FREQ DATA=tdata; TABLES educ grad1 grad2 grad3 ; Cumulative Cumulative educ Frequency Percent Frequency Percent --------------------------------------------------------1 3 3.03 3 3.03 3 4 4.04 7 7.07 4 23 23.23 30 30.30 5 14 14.14 44 44.44 6 12 12.12 56 56.57 7 16 16.16 72 72.73 8 10 10.10 82 82.83 9 17 17.17 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad1 Frequency Percent Frequency Percent ----------------------------------------------------------1 43 43.00 43 43.00 2 57 57.00 100 100.00 Cumulative Cumulative grad2 Frequency Percent Frequency Percent ----------------------------------------------------------1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad3 Frequency Percent Frequency Percent ----------------------------------------------------------1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Coded the missing value for educ to 2 PROC FREQ DATA=tdata; TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW NOCOL; TITLE 'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad1 educ grad1 Frequency‚ 1‚ 2‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ . ‚ 0 ‚ 1 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 3 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 0 ‚ 4 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 0 ‚ 23 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 5 ‚ 0 ‚ 14 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 6 ‚ 0 ‚ 12 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 7 ‚ 16 ‚ 0 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 8 ‚ 10 ‚ 0 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 9 ‚ 17 ‚ 0 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 43 57 Total 1 3 4 23 14 12 16 10 17 100 This shows that the missing value for educ got assigned a value of 2 * Recode sbp12 into 3 levels; if if if if sbp12 sbp12 sbp12 sbp12 = . < 120 < 140 >=140 then then then then sbp12c sbp12c sbp12c sbp12c = = = = . 1 2 3 ; else ; else ; else ; With if-then-else definitions SAS stops executing after the first true statement Values < 120 will be assigned value of 1 Values 120-139 will be assigned value of 2 Values >=140 will be assigned value of 3 Missing values will be assigned to missing PROC FREQ DATA=tdata; TABLES sbp12c sbp12; RUN; OUTPUT Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 Cumulative Cumulative sbp12 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 93 1 1.09 1 1.09 94 1 1.09 2 2.17 101 1 1.09 3 3.26 104 1 1.09 4 4.35 105 1 1.09 5 5.43 (more values) 147 1 1.09 87 94.57 148 1 1.09 88 95.65 149 1 1.09 89 96.74 153 1 1.09 90 97.83 154 1 1.09 91 98.91 158 1 1.09 92 100.00 Frequency Missing = 8 * Easy but costly error to make; if if if if sbp12 sbp12 sbp12 sbp12 = . < 120 < 140 >=140 then then then then sbp12c sbp12c sbp12 sbp12c PROC FREQ DATA=tdata; TABLES sbp12c; RUN; = = = = . 1 2 3 ; else ; else ; else ; How come no values of 2 and why so many missing? The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51 Important Facts When Creating New Variable 1. New variables are initialized to missing 2. Missing values are < any value if var < value (true if var is missing) 3. Reference missing values for numeric variables as . 4. Reference missing values for character variables as ' ' if sbp = . then ... (or if missing(sbp)) if clinic = ' ' then ... SAS Handling of Missing Data When Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); If any variable on the right-hand side is missing then the new variable will be missing • Indirect assignments if age < 50 then young = 1; else young=2; New variables are initialized to missing but may be given a value if any of the IF statements are true What Value to Set New Variable if age < 20 then teenager = 1; else if age >=20 then teenager = 2; if age < 20 then teenager = 1; else if age >=20 then teenager = 0; if age < 20 then teenager = ‘YES’; else if age >=20 then teenager = ‘NO’; * Program 6 SAS Functions ; DATA example; INFILE ‘C:\SAS_Files\tomhs.data' ; INPUT @058 height 4.1 @085 weight 5.1 @172 ursod 3. @236 (se1-se10) (1.0 + 1); bmi = (weight*703.0768)/(height*height); rbmi1 = ROUND(bmi,1); lursod = LOG(ursod); seavg = MEAN (OF se1-se10); semax = MAX (OF se1-se10); semin = MIN (OF se1-se10); * Use of dash notation ; seavg = MEAN (OF se1-se10); This is the same as seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); The OF is very important. Otherwise SAS thinks you are subtracting se10 from se1. To use this notation the ROOT of the name must be the same. * Two ways of computing average ; seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); Versus seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10; Using mean function computes the average of nonmissing values. Result is missing only if all values all missing. Using + formula requires all values be non-missing otherwise result will be missing if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10); What does this statement do? PROC PRINT DATA = example (OBS=15); VAR bmi rbmi1 rbmi2 seavg semin semax ; TITLE 'Listing of Selected Data for 15 Patients '; RUN; PROC FREQ DATA = example; TABLES semax; TITLE 'Distribution of Worse Side Effect Value'; TITLE2 'Side Effect Scores Range from 1 to 4'; RUN; ods graphics on; PROC UNIVARIATE DATA = example ; VAR ursod lursod; QQPLOT ursod lursod; TITLE 'Quantile Plots for Urine Sodium Data'; RUN; Listing of Selected Data for 10 Patients Obs bmi rbmi1 seavg semin semax 1 28.2620 28 1.1 1 2 2 35.9963 36 1.0 1 1 3 27.0489 27 1.0 1 1 4 28.2620 28 1.1 1 2 5 33.2008 33 1.0 1 1 6 27.7691 28 1.2 1 2 7 32.6040 33 1.0 1 1 8 22.4057 22 1.2 1 2 9 37.2037 37 1.1 1 2 10 33.1717 33 1.7 1 3 Distribution of Worse Side Effect Value Side Effect Scores Ranges from 1 to 4 The FREQ Procedure semax Frequency Cumulative Cumulative Frequency Percent Percent ---------------------------------------------------------1 33 33.00 33 33.00 2 52 52.00 85 85.00 3 13 13.00 98 98.00 4 2 2.00 100 100.00 2 patients had at least 1 severe side effect Log transformed value shows a better linear pattern