LISA SHORT COURSE SERIES: INTRODUCTION TO SAS UNIVERSITY William DeShong Fall 2015 Upcoming LISA Short Courses http://www.lisa.stat.vt.edu/?q=short_courses Outline SAS Overview 2. SAS University Environment 3. Data Step 1. Importing Data Sets Merging Data Sets 1. 2. Procedure Step 4. Manipulate/View Data 1. 1. 2. Proc Print Proc Sort Aggregate Data 2. 1. 2. 3. Proc Summary Proc Freq Proc Means Model Data 3. 1. Proc Reg (If time permits) SAS • SAS (an acronym for Statistical Analysis System) is a data-driven programming language that provides information from data. • The functionality of SAS is built around four data-driven tasks. Data Access Addresses or locates the data required by the programmer. Data Analysis Summarizes, reduces, or transforms raw data into meaningful and useful information. Data Management Shapes the data into a form required by the programmer. Data Presentation Communicates information in ways that clearly demonstrate its significance. SAS Program • A SAS program (also called "SAS code") is a series of statements (or "steps") for SAS to execute. There are three types of SAS statements: • DATA statements • PROC statements • global statements • All DATA statements end with a RUN command. • All PROC statements end with either: • RUN command (for almost all statements) • QUIT command (for very, very few statements) Flow of Programming Raw Dataset A DATA statement can be used to (1) create a SAS dataset from scratch, (2) create a SAS dataset from a raw dataset, (3) check for and correct errors in a dataset, and (4) create a SAS dataset by merging, subsetting, and updating existing SAS datasets. DATA Statement Built-In SAS Dataset(s) SAS Dataset PROC Statement Report SAS Pointers • When programming in SAS, keep in mind the following pointers to prevent syntax errors: • Semicolon Check: Every line of code (with exception to formats and labels) end with a semicolon ( ; ). One missing semicolon can destroy an entire SAS program. • Use Comments: You can make one-line comments by placing an asterisk ( * ) in the front of your comment. For a multi-line comment, start with ( /* ) on the first line and end with ( */ ) on the last line. Commented lines of code are ignored by the SAS processor. Comments are used to help the programmer remember parts of the SAS code. SAS University Edition Environment • Let’s take a look at SAS University now! Data Step Importing Datasets • Lets use the Data Importing Wizard! Accessing Permanent SAS Datasets • To access existing SAS datasets, use the following code: libname name_of_library ‘ location_of_file ’; run ; • The name_of_library is a name that you choose to represent the name of the folder to store the SAS datasets in or access the existing SAS datasets. • The location_of_file represents the location where SAS should go to find or save permanent SAS datasets. Accessing Permanent SAS Datasets • Note that in giving the location, you are not mentioning which particular SAS dataset that you want to use. • Rather, you locate the folder or extension (if there is no folder) where the SAS dataset(s) are located. • Most SAS programmers put all of their SAS programs in one folder so that they can access them all at one time. libname name_of_library ‘ location_of_file ’; run ; Accessing Permanent SAS Datasets • The name_of_library is limited to 1 to 8 characters long, can only begin with a letter or underscore, and contains only letters, numbers, or underscores. libname name_of_library ‘ location_of_file ’; run ; • Legal vs. Illegal Names of Libraries • clinic1 • 1_clinic • _%clinic How many of the following seven library • _clinic1 names are legal library names? • _1clinic • clinic_1 • 1clinic_1 4 Descriptive Statistics Functions • Below are a few of the descriptive statistics functions. Most of these descriptive statistics can be found using PROC MEANS or PROC UNIVARIATE. Functions Syntax Calculates SUM sum(argument, argument, …) ; sum of values MEAN mean(argument, argument, …) ; average of nonmissing values MIN min(argument, argument, …) ; minimum value MAX max(argument, argument, …) ; maximum value VAR var(argument, argument, …) ; variance of the values STD std(argument, argument, …) ; standard deviation Date and Time Functions Functions Syntax Calculates TODAY today( ) ; gives today's SAS date value, requires no arguments TIME time( ) ; gives current time, requires no arguments MDY mdy(month_val, day_val, year_val) ; gives back the numeric SAS date value DAY day(date_val) ; gives back the day date of the SAS date value (1-31) QTR qtr(date_val) ; gives back the quarter of the year of the SAS value (1-4) WEEKDAY weekday(date_val) ; gives back the numeric day of the SAS date value (1-7) MONTH month(date_val) ; gives back the month of the SAS date value (1-12) YEAR year(date_val) ; gives back the year of the SAS date value (4 digits) Date and Time Functions • Here are some interesting ones, however. Functions Syntax Calculates INTCK intck('day' , SASdate1 , SASdate2) ; intck('week' , SASdate1 , SASdate2) ; intck('month , SASdate1 , SASdate2) ; intck('qtr' , SASdate1 , SASdate2) ; intck('year' , SASdate1 , SASdate2) ; provides the difference in the number of {days, weeks, months, quarters, years} between two SAS date values. a SAS_end_date which is a multiple of the time interval added to SAS_start_date INTNX intnx('interval' , SAS_start_date , increment, alignment_character) ; alignment_characters • ' b ' = 1st of the month • ' m ' = 15th of the month • ' e ' = 30th/31st of month • ' s ' = same day of SAS_start_date Mathematical Functions • Below are a few of the billions of mathematical functions. There is no way to list them all. You learn them as you learn how to program. Functions Syntax Calculates ROUND round( argument , d ) ; rounds to nearest d where • d =10 (tens) • d = 1 (integer) • d = .1 (tenths) • d = .01 (hundredth) LOG log(argument) ; take the natural log LOG10 log10(argument) ; takes the log base 10 FLOOR floor(argument) ; rounds down to nearest integer CEIL ceil(argument) ; rounds up to nearest integer INT int(argument) ; returns integer part of value only Character Functions Functions Syntax Calculates SCAN scan(argument, n, delimiters) ; returns a specified word from a character word SUBSTR substr(argument, n, delimiters) ; extracts a substring replaces character values TRIM trim(argument) ; trims trailing blanks INDEX index(source, excerpt) ; searches a character value for a specific string UPCASE upcase(argument) ; converts to uppercase letters LOWCASE lowcase(argument) ; converts to lowercase letters PROPCASE propcase(argument) ; uppercase first character value tranwrd(source, target, replace) ; replaces or removes all occurrences of a pattern of characters TRANWRD PROC SORT Statement • The purpose of PROC SORT is to reorganize a SAS dataset by a subset of its variables. proc sort data = libref.datasetname ; by var1 var2 … vark ; run ; • The PROC SORT statement can sort: • by one variable or more than one variable • in ascending order or descending order • remove duplicates while sorting (not by default, you must specify it) PROC SORT Statement • The purpose of PROC SORT is to reorganize a SAS dataset by a subset of its variables. proc sort data = libref2.dataset1 out = libref2.dataset2 ; by var1 var2 … vark ; run ; • If you specify an out statement, SAS will sort the original SAS dataset (dataset1) and put it in the SAS dataset (dataset2). • If you do not use the out statement, SAS will sort dataset1 and store it into dataset1. • Thus, it overwrites the dataset and you lose the original order. Merging Data sets with Match-Merging • With simple match-merging, the SAS programmer is trying to link observations together using the values in the variables listed in the BY statement. proc sort data = SAS-dataset-1 ; by <descending> variable_1 variable_2 … variable_n ; SAS statements ; run ; " " proc sort data = SAS-dataset-k ; by <descending> variable_1 variable_2 … variable_n ; SAS statements ; run ; data newSASdataset ; merge SAS-dataset-1 SAS-dataset-2 SAS-dataset-k ; by <descending> variable_1 variable_2 … variable_n ; SAS statements ; run ; Match-Merging • It is required that all of the original SAS datasets being merged are sorted by the variables in the BY statement first to perform this technique. proc sort data = SAS-dataset-1 ; by <descending> variable_1 variable_2 … variable_n ; SAS statements ; run ; " " proc sort data = SAS-dataset-k ; by <descending> variable_1 variable_2 … variable_n ; SAS statements ; run ; data newSASdataset ; merge SAS-dataset-1 SAS-dataset-2 SAS-dataset-k ; by <descending> variable_1 variable_2 … variable_n ; SAS statements ; run ; Example #1 • Flight attendants for International Airlines are need to pass three exams (federal regulations, customer service, and safety procedures) in order to become certified flight attendants. They can take them at any time, but they must pass the federal regulations exam first before moving on. Below are three permanent SAS datasets showing the attempts by id number and their scores. A score higher than 6 is needed to pass. Match-Merging [Step 1: Use a PROC SORT] • The PROC SORT steps will sort the three SAS datasets by the idnum variable. This will set us up to begin the simple match-merging procedure. Match-Merging [Step 2: The DATA Statement] • The DATA step will link the observations together by the idnum variable. • But how does SAS accomplish this? Match-Merging [Step 3: The Merging] • From all three SAS datasets, SAS searches for the first set of observations with the lowest value for idnum. In this case, it is the missing value in the third dataset. Why? • Notice, however, that there are no observations in the other SAS datasets with an idnum also equal to blank. If an input SAS dataset does not have a matching BY value, then the observation in the output SAS dataset contains missing values for the variables that are unique to that input dataset. Match-Merging [Step 3: The Merging] • SAS now searches for the next lowest value for the idnum variable. • Here, the value appears in only two of the three SAS datasets. • Again, SAS will put missings in for the fr_score variable. Match-Merging [Step 3: The Merging] • The next idnum value is 1226. Fortunately, it appears in all three once. • SAS simply links them together. • So when BY variable value appears the same number of times in all of the SAS dataset, SAS has no problem at all linking them together by order. Match-Merging [Step 3: The Merging] • Similar to the last idnum value, SAS is going to do the same for the value of 2054. • Since there is an equal number of observations in all three SAS datasets, SAS is going to link them together by the order in which they appear. • The first observations in each dataset will link together and the second observations will link together. Match-Merging [Step 3: The Merging] • Now look at this situation. Not only are we missing an observation in the third SAS dataset, but there is an uneven number of observations in the first two. • SAS only knows how to match if there are the same number of observations in the SAS datasets that share the same BY variable values. Match-Merging [Step 3: The Merging] • SAS then links by the order in which they appear in the background. • This is what actually really happens for SAS datasets without a BY variable value observation. • Please note that the replicated observations do not appear in the input SAS datasets. 3362 8 3362 4 3362 . 3362 8 3362 8 3362 . Match-Merging [Step 3: The Merging] • This is similar to the last example (but with more observations). • Questions • How many observations will SAS create for idnum 4524? • What observations will be replicated to perform the match-merging? • How will SAS link these records together? 4524 4 4524 2 4524 4 4524 5 4524 3 4524 7 4524 8 4524 4 4524 7 4524 8 4524 5 4524 7 4524 8 4524 7 4524 7 Match-Merging [Step 3: The Merging] • I think you get the point now, right? • And so you should know what appears next in the SAS dataset. • There will be two observations for idnum 5702… 5702 9 5702 9 5702 5 5702 9 5702 9 5702 9 Match-Merging [Step 3: The Merging] • There will be three observations for idnum 6256… 6256 1 6256 8 6256 9 6256 5 6256 8 6256 9 6256 9 6256 8 6256 9 Match-Merging [Step 3: The Merging] • There will be one observation for 7803… 7803 9 7803 8 7803 8 Match-Merging [Step 3: The Merging] • There will be two observations for idnum 8008… 8008 9 8008 7 8008 5 8008 9 8008 7 8008 9 Match-Merging [Step 3: The Merging] • And finally, there will be four observations for idnum 9890. 9890 3 9890 2 9890 9 9890 8 9890 4 9890 9 9890 8 9890 5 9890 9 9890 8 9890 9 9890 9 Voila! … the SAS dataset is complete. Common Variable (Simple Match-Merging) • Keep in mind that all four common variable rules apply for the simple match-merging process. • The common variable must have the same variable type (i.e. numeric or character) in each of its SAS original datasets. Otherwise, SAS will return an error message. • The values from the last original SAS dataset overwrite the previous values stored for that variable. • If a common variable has different formats, SAS will use the first format it sees for that variable. • If a common variable has different lengths, SAS will use the first length it sees for that variable. It is this common variable rule that we are going to investigate more right now. The last thing that we want to do is overwrite data. The PROC PRINT Statement • The PROC PRINT statement is the most popularly used procedure in SAS. This statement lets you output a SAS dataset (or a subset of it) in the output window. • The most basic format of the PROC PRINT statement is the following: proc print data = libref.datasetname ; run ; • In this format, SAS will print all of the variables in the SAS dataset into the output window unformatted. Of course, there are ways to enhance the output (which we will cover some now). PROC PRINT: Options • If you want SAS to print specific variables, you can adjust the code by including a var statement. proc print data = libref.datasetname ; var variable1 variable2 … variablek ; run ; • You can also produce column totals for numeric variables by using a sum statement. proc print data = libref.datasetname ; sum num_variable ; run ; PROC PRINT: Options (cont.) • You can also specify not to provide the observation number by including the noobs statement in the code. proc print data = libref.datasetname noobs; run ; • Rather, if you have a variable that represents the identity of each observation, you can use the id statement to replace the default observation number. proc print data = libref.datasetname ; id variable1 ; run ; PROC PRINT: Options (cont.) • Rather than use variable name, you can substitute a label for the variable by including a label statement. But notice where you have to mention it in the code. proc print data = libref.datasetname label ; label variable1 = ‘Variable 1’ ; run ; • You can also specify to print a subset of observations from the SAS dataset based on a condition or a set of conditions using a where statement in the code. proc print data = libref.datasetname ; where insert_condition_here ; run ; PROC CONTENTS Statement • The purpose of PROC CONTENTS is to provide a detailed listing of: • the variables listed in a SAS dataset proc contents data = libref.datasetname ; run ; • the SAS datasets located in a SAS folder proc contents data = libref._all_ ; run ; • The ‘ _all_ ‘ is a SAS keyword to reference all of the SAS datasets in a SAS library. PROC FREQ Statement • Now, we turn our attention to procedures that will help produce results in the output window. • The purpose of PROC FREQ is to create a frequency or relative frequency table over a subset of SAS variables. The code to do this is the following: proc freq data = libref.datasetname ; tables var1 var2 … vark ; run ; • The PROC FREQ statement can not only create a table by one or more variables, but it can also save the results as a SAS dataset. PROC MEANS • Let's start from the basics. The basic form of the PROC MEANS is the following: proc means data = libref.datasetname ; run ; • This basic form: • produces statistical output for all of the numeric variables in the SAS dataset • produces the sample size, mean, standard deviation, minimum, and maximum values by default • We will use our baseball SAS dataset to understand how this procedure works. Scenario • Here is a SAS dataset called baseball. It is located in the ' ia ' library. Scenario • Here is the breakdown of the variables. PROC MEANS • Here is the application of the PROC MEANS without any options: proc means data = ia.baseball ; run ; • Again, without any options, SAS calculates the sample mean, sample standard deviation, sample size, minimum, and maximum values for each numeric variable in the SAS dataset. • The output is placed in a table and is posted in the output window (i.e. no new output window is created from the MEANS procedure unless specified otherwise). • Let's adjust the code to get better output. PROC MEANS [var keyword] • Notice in the last slide, all of the variables were provided in the output. To specify specific variables in the SAS dataset, include a var statement followed by the variables that you only want outputted. proc means data = libref.datasetname ; var variable_1 variable_2 variable_3 …. variable_k ; run ; • SAS will only output the statistics for those that you provided (and in that order). • Note: if you have SAS variables with names that differ by a number at the end of the variable name (for example: exam1 exam2 exam3 exam4 exam5), you can reference all of them by saying the following: var variable_1 - variable_k • For our example, we can say: var exam1 - exam5 PROC MEANS [ <stat-keywords> ] • You can specify which descriptive statistics that you want to output if you list them after the name of the dataset. proc means data = libref.datasetname <stat-keywords> ; var variable_1 variable_2 variable_3 …. variable_k ; run ; • By using this option, you will be trumping the default statistics that is outputted. Now, SAS will only produce the statistics that you specify. • There are dozens of statistical keywords to choose from. PROC REG • The basic form of the PROC REG is the following: proc reg data = libref.datasetname ; id; model responsevar = var1 var2…vark; run ; • This basic form: • produces a linear regression model with model fit, parameter estimates, and • produces the residual diagnostic test • We will use our Salary of Major Leageue Baseball Players SAS dataset to understand how this procedure works. Questions? Special Thanks • Dr. Chris Franck- Assistant Director of LISA • Tonya Pruitt-Administrative Specialist LISA • Dr. Marlow Lemons • Kris Patton • Elaine Perrin • Weibin Xu