EPIB 698D Lecture 2 Raul Cruz Spring 2013 SAS functions • SAS has over 400 functions, with the following general form: Function-name (argument, argument, …) • All functions must have parentheses even if they don’t require any arguments • Example: X=Int(log(10)); Mean_score = mean(score1, score2, score3); The Mean function returns mean of non-missing arguments, which differs from simply adding and dividing by their number, which would return a missing values if any arguments are missing 2 Common Functions And Operators Functions ABS: absolute value EXP: exponential LOG: natural logarithm MAX and MIN: maximum and minimum SQRT: square root SUM: sum of variables Example: SUM (of x1-x10, x21) • Arithmetic: +, -, *, /, ** (not ^) 3 More SAS functions Function Name Max Round Sum Length Example Result Y=Max(1, 3, 5); Y=5 Y=Round (1.236, 2); Y=1.24 Y=sum(1, 3, 5); Y=9 a=‘my cat’; Y=Length (a); Y=6 Trim a=‘my ’, b=‘cat’ Y=trim(a)||b Y=‘mycat’ 4 Using IF-THEN statement • IF-THEN statement is used for conditional processing. Example: you want to derive means test scores for female students but not male students. Here we derive means conditioning on gender =‘female’ • Syntax: If condition then action; Eg: If gender =‘F’ then mean_score =mean(scr1, scr2); 5 Using IF-THEN statement List of Logical comparison operators Logical comparison Mnemonic term symbol Equal to EQ = Not equal to NE ^= or ~= Less than LT < Less than or equal to LE <= Greater than GT > greater than or equal to GE >= Equal to one in a list IN Note: Missing numeric values will be treated as the most negative values you can reference on your computer 6 Using IF-THEN statement • Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 • Task: To group student based on their age (<20, [20-40), [40-60), >=60) 7 data conditional; input Age Gender $ Midterm Quiz $2. FinalExam; datalines; 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 ; data new1; set conditional; if Age < 20 then AgeGroup = 1; if 20 <= Age < 40 then AgeGroup = 2; if 40 <= Age < 60 then AgeGroup = 3; if Age >= 60 then AgeGroup = 4; Run; 8 Multiple conditions with AND and OR • IF condition1 and condition2 then action; • Eg: If age <40 and gender=‘F’ then group=1; If age <40 or gender=‘F’ then group=2; 9 IF-THEN statement, multiple conditions • Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 • Task: To group student based on their age (<40, >=40),and gender 10 data new1; set conditional; If age <40 and gender='F' then group=1; If age >=40 and gender='F' then group=2; IF age <40 and gender ='M' then group=3; IF age >=40 and gender ='M' then group=4; run; 11 • Note: Missing numeric values will be treated as the most negative values you can reference on your computer • Example: group age into age groups with missing values 21 M 80 B- 82 20 F 90 A 93 . M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 . M 88 C 93 12 IF-THEN statement, with multiple actions • Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 • Task: To group student based on their age, and assign test date based on the age group 13 Multiple actions with Do, end • Syntax: IF condition then do; Action1 ; Action 2; End; If age <=20 then do ; group=1; exam_date =“Monday”; End; 14 IF-THEN/ELSE statement • Syntax IF condition1 then action1; Else if condition2 then action2; Else if condition3 then action3; • IF-THEN/Else statement has two advantages than IF-THEN statement (1) It is more efficient, use less computing time (2) Else logic ensures that your groups are mutually exclusive so that you do not put one obervation into more than one groups. 15 IF-THEN/ELSE statement data new1; set conditional; if Age < 20 then AgeGroup = 1; else if Age >= 20 and Age < 40 then AgeGroup = 2; else if Age >= 40 and Age < 60 then AgeGroup = 3; else if Age >= 60 then AgeGroup = 4; run; 16 Subsetting your data • You can subset you data using a IF statement in a data step • Example: Data new1; Set new; If gender =‘F’; Data new1; Set new; If gender ^=‘F’ then delete; 17 Stacking data sets using the SET statement • With more than one data, the SET statement stacks the data sets one on top of the other • Syntax: DATA new-data-set; SET data-set-1 data-set-2 … data-set-n; • The Number of observations in the new data set will equal to the sum of the number of observations in the old data sets • The order of observations is determined by the order of the list of old data sets • If one of the data set has a variables not contained in the other data sets, then observations from the other data sets will have missing values for that variable 18 Stacking data sets using the SET statement • Example: Here is data set contains information of visitors to a park. There are two entrances: south entrance and north entrance. The data file for the south entrance has an S for south, followed by the customers pass numbers, the size of their parties, and ages. The data file for the north entrance has an N for north, the same data as the south entrance, plus one more variable for parking lot. /* South .dat */ S 43 3 27 S 44 3 24 S 45 3 2 /* North.dat */ N N N N 21 87 65 66 5 4 2 2 41 33 67 7 1 3 1 1 19 DATA southentrance; INPUT Entrance $ PassNumber PartySize Age; cards; S 43 3 27 S 44 3 24 S 45 3 2 ; run; DATA northentrance; INPUT Entrance $ PassNumber PartySize Age Lot; Cards; N 21 5 41 1 N 87 4 33 3 N 65 2 67 1 N 66 2 7 1 ; run; DATA both; SET southentrance northentrance; RUN; 20 Combining data sets with one-to-many match • One-to-many match: matching one observation from one data set with more than one observation to another data set • The statement of one-to-many match is the same as oneto-one match DATA new-data-set; Merge data-set-1 data-set-2; By variable-list; • The data sets must be sorted first by the BY variables • If the two data sets have variables with the same names, besides the BY variables, the variables from the second data set will overwrite any variables with the same name in the first data set 21 Example: Shoes data • The shoe store is putting all its shoes on sale. They have two data file, one contains information about each type of shoe, and one with discount information. We want to find out new price of the shoes Shoe data: Discount data Max Flight running 142.99 Zip Fit Leather walking 83.99 Zoom Airborne running 112.99 Light Step walking 73.99 Max Step Woven walking 75.99 Zip Sneak c-train 92.99 c-train .25 running .30 walking .20 22 DATA regular; INFILE datalines dsd; length style $15; INPUT Style $ ExerciseType $ RegularPrice @@; datalines; Max Flight , running, 142.99, … ; PROC SORT DATA = regular; BY ExerciseType; DATA discount; INPUT ExerciseType $ Adjustment @@; cards; c-train .25 … ; DATA prices; MERGE regular discount; BY ExerciseType; NewPrice = ROUND(RegularPrice - (RegularPrice * Adjustment), .01); RUN; 23 Simplifying programs with Arrays • SAS Arrays are a collection of elements (usually SAS variables) that allow you to write SAS statements referencing this group of variables. • Arrays are defined using Array statement as: ARRAY name (n) variable list name: is a name you give to the array n: is the number of variables in the array eg: ARRAY store (4) macys sears target costco Store(1) is the variable for macys Store(2) is the variable for sears 24 Simplifying programs with Arrays • A radio station is conducting a survey asking people to rate 10 songs. The rating is on a scale of 1 to 5, with 1=Do not like the song; 5-like the song; • IF the listener does not want to rate a song, he puts a “9” to indicate missing values • Here is the data with location, listeners age and rating for 10 songs Albany 54 4 3 5 9 9 2 1 4 4 9 Richmond 33 5 2 4 3 9 2 9 3 3 3 Oakland 27 1 3 2 9 9 9 3 4 2 3 Richmond 41 4 3 5 5 5 2 9 4 5 5 Berkeley 18 3 4 9 1 4 9 3 9 3 2 • We want to change 9 to missing values (.) 25 Simplifying programs with Arrays DATA songs; INFILE ‘E:\radio.txt'; INPUT City $ 1-15 Age domk wj hwow simbh kt aomm libm tr filp ttr; ARRAY song (10) domk wj hwow simbh kt aomm libm tr filp ttr; DO i = 1 TO 10; IF song(i) = 9 THEN song(i) = .; END; run; 26 Using shortcuts for lists of variable names • When writing SAS programs, we will often need to write a list of variables names. When you have a data will many variables, a shortcut for lists of variables names is helpful • Numbered range list: variables which starts with same characters and end with consecutive number can be part of a numbered range list • Eg : INPUT cat8 cat9 cat10 cat11 INPUT cat8 – cat11 27 Using shortcuts for lists of variable names • Name range list: name range list depends on the internal order, or position of the variables in a SAS dataset. This is determined by the appearance of the variables in the DATA step. • Eg : Data new; Input x1 x2 y2 y3; Run; • Then the internal range list is: x1 x2 y2 y3 • Shortcut for this variable list is x1-y3; • Proc contents procedure with the POSITION option can be used to find out the internal order 28 Using shortcuts for lists of variable names DATA songs; INFILE ‘E:\radio.txt'; INPUT City $ 1-15 Age domk wj hwow simbh kt aomm libm tr filp ttr; ARRAY new (10) Song1 - Song10; ARRAY old (10) domk -- ttr; DO i = 1 TO 10; IF old(i) = 9 THEN new(i) = .; ELSE new(i) = old(i); END; AvgScore = MEAN(OF Song1 - Song10); run; 29 Sorting, Printing and Summarizing Your Data • SAS Procedures (or PROC) perform specific analysis or function, produce results or reports • Eg: Proc Print data =new; run; • All procedures have required statements, and most have optional statements • All procedures start with the key word “PROC”, followed by the name of the procedure, such as PRINT, or contents • Options, if there are any, follow the procedure name • Data=data_name options tells SAS which dataset to use as an input for this procedure. NOTE: if you skip it, SAS will use the most recently created dataset, which is not necessary the same as the mostly recently used data. 30 BY statement • The BY statement is required for only one procedure, Proc sort PROC Sort data = new; By gender; Run; • For all the other procedures, BY is an optional statement, and tells SAS to perform analysis for each level of the variable after the BY statement, instead of treating all subjects as one group Proc Print data =new; By gender; Run; • All procedures, except Proc sort, assumes you data are already sorted by the variables in your BY statement 31 PROC Sort • Syntax Proc Sort data =input_data_name out =out_data_name ; By variable-1 … variable-n; • The variables in the by statement are called by variables. • With one by variable, SAS sorts the data based on the values of that variable • With more than one variable, SAS sorts observations by the first variable, then by the second variable within the categories of the first variable, and so on • The DATA and OUT options specify the input and output data sets. Without the DATA option, SAS will use the most recently created data set. Without the OUT statement, SAS will replace the original data set with the newly sorted version 32 PROC Sort • By default, SAS sorts data in ascending order, from the lowest to the highest value or from A to Z. To have the the ordered reversed, you can add the keyword DESCENDING before the variable you want to use the highest to the lowest order or Z to A order • The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variables 33 PROC Sort • Example: The sealife.txt contains information on the average length in feet of selected whales and sharks. We want to sort the data by the family and length Name Family Length beluga whale 15 whale shark 40 basking shark 30 gray whale 50 mako shark 12 sperm whale 60 dwarf shark .5 whale shark 40 humpback . 50 blue whale 100 killer whale 30 34 PROC Sort • Example: The sealife.txt contains information on the average length in feet of selected whales and sharks. We want to sort the data by the family and length Name Family Length beluga whale 15 whale shark 40 basking shark 30 gray whale 50 mako shark 12 sperm whale 60 dwarf shark .5 whale shark 40 humpback . 50 blue whale 100 killer whale 30 35 PROC Sort DATA marine; INFILE ‘E:\Sealife.txt'; INPUT Name $ Family $ Length; run; * Sort the data; PROC SORT DATA = marine OUT = seasort NODUPKEY; BY Family DESCENDING Length; run; 36 Summarizing you data with PROC MEANS • The proc means procedure provide simple statistics on numeric variables. Syntax: Proc means options ; • List of simple statistics can be produced by proc means: MAX: the maximum value MIN: the minimum value DEFAULT MEAN: the mean N : number of non-missing values STDDEV: the standard deviation NMISS: number of missing values RANGE: the range of the data SUM: the sum MEDIAN: the median 37 Proc means • Options of Proc means: By variable-list : perform analysis for each level of the variables in the list. Data needs to be sorted first Class variable-list: perform analysis for each level of the variables in the list. Data do not need to be sorted Var variable list: specifies which variables to use in the analysis 38 Proc means • A wholesale nursery is selling garden flowers, they want to summarize their sales figures by month. The data is as follows: ID 756-01 756-01 834-01 834-01 901-02 834-01 756-01 901-02 756-01 Date Lily 05/04/2001 120 05/14/2001 130 05/12/2001 90 05/14/2001 80 05/18/2001 50 06/01/2001 80 06/11/2001 100 06/19/2001 60 06/25/2001 85 SnapDragon 80 90 160 60 100 60 160 60 110 Marigold 110 120 60 70 75 100 75 60 100 39 DATA sales; INFILE ‘E:\Flowers.txt'; INPUT CustomerID $ @9 SaleDate MMDDYY10. Lily SnapDragon Marigold; Month = MONTH(SaleDate); PROC SORT DATA = sales; BY Month; * Calculate means by Month for flower sales; PROC MEANS DATA = sales; BY Month; VAR Lily SnapDragon Marigold; TITLE 'Summary of Flower Sales by Month'; RUN; 40 Proc GCHART for bar charts • Example: A bar chart showing the distribution of blood types from the Blood data set /* The blood.txt data contain information of 1000 subjects. The variables include: subject ID, gender, blood_type, age group, red blood cell count, white blood cell count, and cholesterol. DATA blood; INFILE ‘C:\blood.txt'; INPUT ID Sex $ BloodType $ AgeGroup $ RBC WBC Cholesterol; run; title "Distribution of Blood Types"; proc gchart data=blood; vbar BloodType; run; Proc GCHART for bar charts • VBAR: request a vertical bar chart for the variable • Alternatives to VBAR are as follows: HBAR: horizontal bar chart VBAR3D: three-dimensional vertical bar chart HBAR3D: three-dimensional horizontal bar chart PIE: pie chart PIE3D: three-dimensional pie chart DONUT: donut chart A Few Options proc gchart data=blood; vbar bloodtype/space=0 type=percent ; run; Controls spacing between bars Changes the statistic from frequency to percent Type option • Type =freq : displays frequencies of a categorical variable • Type =pct (Percent): displays percent of a categorical variable • Type =cfreq : displays cumulative frequencies of a categorical variable • Type =cpct (cPercent): displays cumulative percent of a categorical variable Basic Output This value of 7,000 corresponds to a class ranging from 6500 to 7500 (with a frequency of about 350) SAS computes midpoints of each bar automatically. You can change it by supplying your own midpoints: vbar RBC / midpoints=4000 to 11000 by 1000; Creating charts with values representing categories • SAS places continuous variables into groups before generating a frequency bar chart • If you want to treat the values as discrete categories, you can use DISCRETE option • Example: create bar chart showing the frequencies by day of the week for the visit to a hospital libname d “C:\”; data day_of_week; set d.hosp; Day = weekday(AdmitDate); run; *Program Demonstrating the DISCRETE option of PROC GCHART; title "Visits by Month of the Year"; proc gchart data =day_of_week; vbar Day / discrete; run; The Discrete Option proc gchart data= day_of_week; vbar day /discrete; run; If you use discrete with a numeric variable you should: 1. Be sure it has only a few distinct values. or 2. Use a format to make categories for it. Discrete establishes each distinct value of the midpoint variable as a midpoint on the graph. If the variable is formatted, the formatted values are used for the construction. Summary Variables • If I want my bar chart to summarize values of some analysis variable for each midpoint, use the sumvar= (and type= ) option. • sumvar= variable name • Type =mean: displays mean of a continuous variable • Type =sum: displays totals of a continuous variable ( this is default value) Creating bar charts representing sums • The GCHART procedure can be used to create bar charts where the height of bars represents some statistic, means (or sums) for example, for each value of a classification variable • Example: Bar chart showing the sum of the Totalsales for each region of the country title "Total Sales by Region"; proc gchart data=d.sales; vbar Region / sumvar=TotalSales type=sum ; format TotalSales dollar8.; run; Creating bar charts representing means proc gchart data=blood; vbar Gender / sumvar=cholesterol type=mean; run; quit; GPLOT • The GPLOT procedure plots the values of two or more variables on a set of coordinate axes (X and Y). • The procedure produces a variety of twodimensional graphs including – simple scatter plots – overlay plots in which multiple sets of data points display on one set of axes Procedure Syntax: PROC GPLOT • PROC GPLOT; PLOT y*x </option(s)>; run; • Example: plot of systolic blood pressure (SBP) by diastolic blood pressure (DBP) title "Scatter Plot of SBP by DBP"; proc gplot data=d.clinic; plot SBP * DBP; run; • Multiple plots can be made in 3 ways: (1)proc gplot; plot y1*x y2*x /overlay; run; plots y1 versus x and y2 versus x using the same horizontal and vertical axes. (2) proc gplot; plot y1*x; plot2 y2*x; run; plots y1 versus x and y2 versus x using different vertical axes. The second vertical axes appears on the right hand side of the graph. (3) proc gplot ; plot y1*x=z; run; uses z as a classification variable and will produce a single graph plotting y1 against x for each value of the variable z. *controlling the axis ranges; title "Scatter Plot of SBP by DBP"; proc gplot data=d.clinic; plot SBP * DBP / haxis=70 to 120 by 5 vaxis=100 to 220 by 10; run;