90-776 Manipulation of Large Data Sets Lab 1 March 10, 1999 Major Skills covered in today’s lab: Bringing in data into SAS Exporting data from SAS Learning to program in SAS I. Example Program First, let’s examine the program, l:\academic\90776\programs\distance.sas. Note: As written, this program saves some files to the “thunderbolt” directory, which you do not have write access to. If you wish to run the program, you will have to change the file references. /*Example Program for 90-9776*/ /**** DISTANCE.SAS (l:\academic\90776\programs\distance.sas) is a program that performs some basic SAS commands. I use data from the example in class and on the top of page 4 of the handout. I have named this data example.txt and saved it as L:\academic\90776\data\text\example.txt ****/ /* The Program will create l:\academic\90776\data\lab1.sd2 */ /* Note: it is always a good idea to include your name and date and important file locations at the top of our programs */ /* Created by: Rob Greenbaum */ /* Date: November 14, 1998 */ /* last 3/6/1999*/ /* We can set some options if we want to - here I set the page length, width, and page number */ OPTIONS ps = 80 ls = 65 pageno = 1; /*I want to create an alias for the directory I will eventually save my data in */ /*LIBNAME sets up an alias for a DIRECTORY */ LIBNAME mydisk 'l:\academic\90776\data'; /* FILENAME sets up an alias for particular text files*/ /* Next, I will give a name to the location of the existing ascii data */ FILENAME extext 'l:\academic\90776\data\text\example.txt'; /* I also want to give a name for a an ascii file that I will create */ FILENAME lab1txt 'l:\academic\90776\data\text\lab1.txt'; /* Now I will tell SAS to create a temporary SAS data set called DIST*/ 1 /* Temporary data sets disappear when the current SAS session ends. DIST is temporary because I do not tell SAS to save the data to any dive */ /* I'll then put the ascii data l:\academic\90776\data\text\example.txt into the temporary SAS data set DIST using infile and input (see page 7 of the handout)*/ DATA dist; INFILE extext; /*tells SAS to find l:\academic\90776\data\text\example.txt */ INPUT name $ sex $ age distance; /*tells SAS var names and order of vars*/ RUN; /* Note that character variable names must be followed by a "$" in the INPUT statement */ /* Let's see what variables SAS read in*/ PROC CONTENTS data=dist; RUN; /* I want to make sure that SAS read in the data properly, so let's tell SAS to print out all of the data*/ PROC PRINT data = dist; RUN; /* Let's find the mean of distance to work*/ PROC MEANS data=dist; var distance; RUN; /* let's create a new variable and save the new data set as both a permanent SAS data set and as a new ASCII data set */ /* Note, we can only create new variables inside of data steps, so we need a new data step */ DATA mydisk.lab1; /* This will create l:\academic\90776\data\lab1.sd2 */ SET dist; /* This brings in the temporary SAS data set dist */ FILE lab1txt; /* Analogous to INFILE, except that I want to write the file to l:\academic\90776\data\text\lab1.txt*/ age2 = age**2; /* this creates an age squared variable */ put name $ sex $ age distance age2; RUN; PROC contents data = mydisk.lab1; RUN; /* we always need to finish the program with a run statement */ 2 II. Read in ASCII data from a file Let’s read in the ASCII data set that the program DISTANCE.SAS created. Refer to the above program for the variable names. To read in ASCII data, we need to use the INFILE and INPUT commands. 1) Write the necessary code to read the file ASCII file l:\academic\90776\data\text\lab1.txt into a temporary SAS data set. 2) Check your log file to make sure that you made no errors. 3) Use the CONTENTS and PRINT procedures to make sure you made no errors. 4) Use the MEANS procedure to find the mean of age and age2. (To perform PROC MEANS on only certain variables, we use the subcommand VAR.) III. Read in SAS data from a file Let’s read in the SAS data set that was created in the program DISTANCE.SAS. To read in an existing SAS data set, we use the SET command. We also need to use a LIBNAME statement to tell SAS what directory the data is in. 1) Write a SAS program to read the file l:\academic\90776\data\lab1.sd2 into a temporary SAS data set. 2) Check your log file to make sure that you made no errors. 3) Use the CONTENTS and MEANS procedures to make sure you made no errors. 4) Save your short program, log file, and output file to your disk. IV. Enter ASCII data right in your program. Save SAS data. To enter SAS data right into your program, we can use INPUT and DATALINES commands. See the lecture notes or the book for more information. To save a permanent SAS data set, you must use a LIBNAME and a two-part name for the file. 1) Write the code to bring the 1999 Minnesota Timberwolves home attendance data into a permanent SAS data set. Save this data somewhere on your own disk. DATE 1 2 3 4 ATTENDANCE 16422 19006 18151 17907 TOTAL 16422 35428 53579 71486 AVERAGE 16422 17714 17860 17872 (Did you remember that variable names cannot exceed eight characters?) 2) Use the PRINT procedure to check your work. 3) Check your directory to confirm that you saved the SAS dataset. 3 V. Enter ASCII data with missing observations. Save the data as ASCII data. When you have records that have missing values at the end of a record, you need to include the option MISSOVER in the INFILE statement. This option tells SAS to skip over the missing value and go on to the next record. SAS records a missing observation as a period. The Minnesota Timberwolves data now includes information for the first eight games of the season. This data is stored as l:\academic\90776\data\text\twolves.txt. 1) Read in this ASCII data set into a temporary data set. The 4 variables are the same as before. 2) Use the PRINT procedure to look at the data. The data is reproduced in the table below. Does your data look like the table below? Why not? 3) Fix your INFILE command so that it contains the word MISSOVER at the end of the command (see p. 303 of the text for more help). 4) Create a new average attendance variable and call it AVG2. Hint: the average attendance is just the total attendance divided by the number of games played. 5) Save this new data set as an ASCII file on your own disk (you will need to use the FILE and PUT commands – see pp. 304-305 of the text for more help). 6) Use the PRINT procedure to print out only the average and avg2 variables. Next week we will learn how do drop extra variables such as the AVERAGE variable. 1999 Minnesota Timberwolves Home Attendance DATE ATTENDANCE TOTAL AVERAGE 1 16422 16422 16422 2 19006 35428 17714 3 18151 53579 17860 4 17907 71486 17872 5 16848 88334 6 15374 103708 17285 7 16219 119927 8 14776 134703 16838 4