MORE DETAILS ON SAS In what follows, SAS keywords are typed in capital letters. You will not have to type them in capitals, but you must spell them exactly as SAS does. Variable names and dataset names are typed in lowercase bold letters. I have used generic names here like x or y. You would replace these generic names with the names you have decided on in your program. These names must start with a letter or underscore _, and must contain only letters, numbers or underscores. Of course, you will not be typing them in bold. SAS has some special variable names that it uses, which begin and end with an underscore, so it is safer not to start your variable names with an underscore. Optional information is typed in lowercase italics. You do not have to include this information unless you want to customize SAS to perform the analysis in the way you wish. THE DATA PARAGRAPH Data paragraphs create data sets, manipulate them to create new variables or to select subsets of the data, and can output permanent data sets in SAS format. Data paragraphs always begin with the DATA statement. DATA mydata; Instructs SAS to begin creating a working file named mydata. You do not have to give names to working files, but it helps in complicated programs when many datasets might be accessed. To assign names to the variables that are being read in, you need an INPUT statement. INPUT a1 $ a2 $ x1 x2; would describe a situation where there are four variables recorded on each line of the data set. The first two variables, a1 and a2, are character variables with non-numeric entries, so their variable name is followed by a $. The last two variables, x1 and x2, are numeric. Whenever you wish to refer to these variables later in the program, you must spell the names exactly as they were spelled in the input statement. In this simplest form of the INPUT statement, there is no formatting information to tell SAS where on a line to find the variables. Hence, SAS will default to expecting the variables to each be separated by one or more spaces. After the INPUT statement, you can add lines which tell SAS how to create new variables from the original variables listed in the INPUT statement. For example, suppose I wanted to create a new variable y = x1/x2. After the INPUT statement, I could put y = x1/x2; In general, these kinds of new variable lines have the form Newvariablename = mathematical expression in terms of existing variables; Your mathematical expression can use the usual arithmetic operators: 1 + addition - subtraction * multiplication y ** power (as in 10**y which is 10 ) / division There are also a number of built in functions that you might find useful, such as Y=SQRT(x1) ; which computes the square root of x1. Y=LOG(x1); which computes the natural logarithm of x1. Y=LOG10(x1); which computes the logarithm base 10 of x1. Y=ABS(x1); which returns the absolute value of x1; There are many dozens of functions listed in SAS Help. You can use IF statements to perform different actions depending on some conditions. For example, I might want to create a new variable called big which is 0 when x1 is less than 100 and 1 when x1 is greater than or equal to 100. Then I could use two lines: big = 0; IF (x1 ge 100) THEN big=1; SAS uses the abbreviations = or EQ for ‘equal’, ~= or NE for ‘not equal’, < or LT for ‘less than’, > or GT for ‘greater than’, <= or LE for less than or equal to, and GE or >= for ‘greater than or equal to’. Sometimes you only want to use part of a dataset. For example, if I were analyzing data on second and third graders, I might just want to use the third graders for my analysis. You can adapt the IF statement for this. The statement IF (mathematical expression); Will keep any data line where the mathematical expression is true, and discard those where it is false. For example, if I only wanted to keep datalines where x1 is greater than 50, I could write IF (x1 gt 50); The actual data comes last in the paragraph, following a CARDS statement. Some people say DATALINES instead of CARDS. End the data with a semicolon, then a RUN; statement to actually execute the paragraph. 2 EXAMPLE *this program reads in the student names and first three quiz scores; *it checks to see if the lowest quiz score is less than 5. *if the lowest quiz score is less than 5, it creates a warning flag; DATA grades; INPUT student $ quiz1 quiz2 quiz3; Worst=MIN(quiz1,quiz2,quiz3); Warning=0; IF (worst lt 5) THEN warning=1; CARDS; Sam 8 10 9 Joe 10 4 8 Abby 8 9 7 ; RUN; DATA trouble; SET grades; IF (warning = 1); RUN; PROC PRINT DATA=trouble; RUN; The first DATA paragraph created the original data set grades, used the MIN function to find the minimum of the quiz scores, and created a variable called warning that is 1 when the worst quiz score is too low. The second DATA paragraph creates a new data set named trouble. To begin, it used the SET statement to import the original dataset grades. Since grades was created in SAS, SAS remembers all the variable names. There is no need for another INPUT statement. The dataset trouble will only have one line in it – the line for the student (Joe) who has warning=1. Finally, PROC PRINT will print out the information in the dataset trouble, so I can see which students have a low quiz score. 3 SOME USEFUL PROCS PROC PRINT data=datasetname; Prints out the information in the dataset specified in the optional data=datasetname field. If you leave out data=datasetname, then it will print the contents of the most recently created dataset. VAR -- If you do not want to print out every variable in the dataset, list the ones you do want to print in the VAR statement. Example: PROC PRINT data=grades; VAR name worst quiz1 quiz2 quiz3; RUN; PROC SORT data=datasetname; BY variablename; Sorts the lines in the information in ascending order of the variable specified in the BY statement. If variablename is a character variable, the order is alphabetic. Example: PROC SORT data=grades; BY student; RUN; PROC PRINT data=grades; VAR student worst; RUN; Would give output like Abby Joe Sam 7 4 8 PROC MEANS data=datesetname; VAR variablenames (separated by spaces); BY variablenames; Produces simple means, standard deviations, sample sizes, minimums and maximums for the variables listed in the VAR statement. If there is more than one variable, separate the names by spaces. Using the BY statement is optional, but you can use it if you wanted the means, etc., computed separately depending on their value of the variable named in the BY statement. To use a BY statement, you must have first SORTED the data by the value of that same variable. Example: Suppose your data set had vocabulary scores for 2nd graders and 3rd graders. You want to get the means and standard deviations separately for the 2nd and 3rd graders. 4 DATA vocabulary; INPUT student $ grade score; CARDS; Tim 2 28 Jack 3 42 Ellen 2 31 Shayna 2 18 Antwaan 3 39 ; RUN; PROC SORT data=vocabulary; BY grade; RUN; PROC MEANS data=vocabulary; VAR score; BY grade; RUN; PROC UNIVARIATE data=datasetname; VAR variablenames; BY variablename; HISTOGRAM variablenames; Proc Univariate is very similar to PROC MEANS but produces far more detailed output, with medians, quartiles, etc. Use of the VAR and BY statements is just like for PROC MEANS. HISTOGRAM is optional. If you just list HISTOGRAM with some variablenames, SAS will choose its own intervals and label the bars using the midpoints of the intervals. I like to control the choice of intervals, and my preference is to label the left edges of the bars. You can customize the choice of intervals the way you like by using the ENDPOINTS or MIDPOINTS options. Options come after the main part of the HISTOGRAM statement, following a / mark. For instance, suppose in the vocabulary score data I wanted histograms that used a class width of 5, starting at 15. All the data must lie between the first and last endpoints. Then I could say: PROC UNIVARIATE data=vocabulary; VAR score; HISTOGRAM score / endpoints=(15 to 45 by 5); 5