Laboratory 1: Introduction to SAS This session and the next session are designed to provide you with hands on instructions to ensure that you develop a basic understanding of data handling in SAS (Statistical Analysis System) including reading data to create SAS dataset(s) and running some basic procedures using the latest version of SAS, version 8.2. Datasets for the laboratory sessions are available on the web. Please download them on a floppy disk and bring it (along with a blank floppy disk) to each laboratory session. Make sure you make copies of data files on another floppy disks in case the first floppy disk gets corrupted for some reason or you have no access to the internet. Today you will work on 4 case-control978.* files. The others are for future use. Description of the Data Data of a case-control study of esophageal cancer among men, each row (record) contains data for one subject. There are 978 rows containing information on 200 esophageal cancer cases and 778 population controls. There are 7 columns containing data on 7 variables. The definition of these 7 variables with a description of the variable codes are given below: -----------------------------------------------------------------------------------------------Column no Variable Name Range/Description -----------------------------------------------------------------------------------------------1 Record Record number, 1-978 2 ID 1-1640 = control (n = 778) 6000-6206 = case (n = 200) 3 Age Age in years (21-91) 4 Tobacco Cigarette smoking 0 = never smoked 1 = 1- 4 grams/day 2 = 5- 9 grams/day 3 = 10-14 grams/day 4 = 15-19 grams/day 5 = 20-29 grams/day 6 = 30-39 grams/day 7 = 40+ grams/day 8 = did not answer 9 = unknown 5 Alcohol Daily consumption in grams 6 Status Case-control status 0 = controls 1 = cases 7 Sex 1 = male ----------------------------------------------------------------------------------------------- Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 1 Henian Chen Fall 2002 The data set looks like this: record 1 2 3 . . . 977 978 id age tobacco alcohol status sex 1 2 3 . . . 6205 6206 42 45 35 .. .. .. 71 40 0 2 0 . . . 5 2 139 66 24 … … … 91 141 0 0 0 . . . 1 1 1 1 1 . . . 1 1 We saved this data as: 1. case-control978.dat: 2. case-control978.txt: 3. case-control978.wk3: 4. case-control978.dbf: “raw” data file (text file) tab-delimited, write variable names to spreadsheet Lotus 1-2-3 Release 3 spreadsheet dBASE III file Five Windows of SAS When SAS is started, there are five main windows open, namely the Editor, Log, Output, Results, and Explorer. The Editor, Log, and Explorer windows are visible. The Results window is hidden behind the Explorer window and the Output window is hidden behind the program Editor and Log windows. You can also use the function keys to switch windows. F7 brings you to the Output window, F5 the Editor, and F6 the Log. 1. Editor: The Editor window is for typing in editing, and running programs. Some aspects of the Editor window will be familiar as standard features of Windows applications. The File menu allows programs to be read from a file, saved to a file, or printed. The File menu also contains the command to exit from SAS. The Edit menu contains the usual options for cutting, copying, and pasting text and those for finding and replacing text. The Run menu is specific to the Editor window and will not be available if another window is the active window. The program currently in the Editor window can be run by choosing the Submit option from the Run menu. It is possible to run part of the program in the Editor window by selecting the text and then choosing Submit from the Run menu. When a SAS program is run, two types of output are generated: the log and the procedure output, and there are displayed in the Log and Output winds. Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 2 Henian Chen Fall 2002 2. Log: The log window shows the SAS statements that have been submitted together with information about the execution of the program, including warning and error messages. The contents of the Log window cannot be edited. The Clear all option in the Edit menu will empty the window (the same if you use Ctrl + e). 3. Output: The Output window shows the printed results of any procedures. It is here that the results of any statistical analyses are shown. The contents of the Output window cannot be edited. The Clear all option in the Edit menu will empty the window (the same if you use Ctrl + e). 4. Results: The Results window is a graphical index to the Output window useful for navigating around large amounts of procedure output. Right-clicking on a procedure, or section of output, allows that portion of the output to be viewed, printed, deleted, or saved to file. 5. Explorer: The Explorer window allows the contents of SAS data sets and libraries to be examined interactively, by double-clicking on them. Managing the windows can be done with the normal windows controls, including the Window menu. There is also a row of buttons and tabs at the bottom of the screen that can be used to select a window. SAS Language In SAS there are almost no point and click commands, so it is necessary to learn to write in code. SAS uses a color-coded system. Most SAS statements begin with a keyword that identifies the type of statement. When it recognizes keywords as they are typed SAS changes their color to blue. If a word remains red, this indicates a problem. The word may have been mistyped or is invalid for some other reason. This color-coded system is very helpful in understanding the syntax and in finding your errors. A typical SAS program consists of data steps and procedure (proc) steps. A data step is used to prepare data for analysis. It creates a SAS data set and may reorganize the data and modify it in the process. A proc step is used to perform a particular type of analysis, or statistical test, on the data in a SAS data set. Data and proc steps begin with a data or proc statement, respectively, and end at the next data, proc or run statement. When a data step has the data included within it, the step ends after the data. Understanding where steps begin and end is very important because SAS programs are executed in whole steps. If an incomplete step is submitted, it will not be executed. The statements that were submitted will be listed in the log, but SAS will appear to have stopped at that point without explanation. In fact, SAS will simply be waiting for the step to be completed before running it. For this reason it is Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 3 Henian Chen Fall 2002 good practice to explicitly mark the end of each step by inserting a run statement and especially important to include one as the last statement in the program. The Editor offers several visual indicators of the beginning and end of steps. The data, proc, and run keywords are color-coded in Navy blue, rather than the standard blue used for other keywords. Global statements can be placed anywhere. If you are placed within a step, they will apply to that step and all subsequent steps until reset. A simple example of a global statement is the title statement, which defines a title for procedure output. The title is then used until changed or reset. Statements can extend over more than one line and there may be more than one statement per line. However, keep to one statement per line, as far as possible, to avoid errors. Names must be given to variables and data sets in writing a SAS program. These can contain letters, numbers, and underline characters, and can be up to 32 characters in length but cannot begin with a number. Variable names can be in upper or lower case, or a mixture, but changes in case are ignored. Don’t forget, all SAS statements must end with a semicolon. The most common mistake for new users is to omit the semicolon and the effect is to combine two statements into one. Data Step Before data can be analysed in SAS, they need to be read into a SAS dataset. Creating a SAS data set for subsequent analysis is the primary function of the data step. A data step is also used to manipulate, or reorganize the data. The data can be “raw” data or come from a previously created SAS data set. In SAS, “raw” data means the data you can input directly to SAS or a text file, or ASCII file. Such files only include the printable characters plus tabs, spaces, and end-of line characters. Instructions follow for creating datasets by input cards, reading a raw data file, using a previously created SAS dataset, and importing a dataset from an external data source. 1. Using “cards” statement to input your data to SAS and create a SAS dataset We have a small set of data about seniority, publication, and sex on 6 faculty members. Instead of storing this data as a “raw” data file it would be easier, in this case, to include the data directly in the SAS program. Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 4 Henian Chen Fall 2002 ----------------------------------------------------------------------Time since Ph.D. Sex No. of Publications ----------------------------------------------------------------------6 F 3 8 M 17 9 F 11 6 M 6 10 M 48 5 F 30 ----------------------------------------------------------------------Write a SAS program in the Editor window to create a SAS dataset (named “exercise”) by inputting above data. data exercise; input time sex $ publication; cards; 6 f 3 8 m 17 9 f 11 6 m 6 10 m 48 5 f 30 ; run; data statement gives a SAS dataset name: exercise input statement in the example specifies three variables: time, sex, and publication, and the dollar sign ($) after sex indicates that it is a character variable. SAS has only two types of variables: numeric and character. cards statement must follow all other statements in the DATA step. The data lines are followed by a line with just a semicolon. This line is called a null statement and identifies the end of the data. Please run this program, and check the Log window. The Log window will show that “The data set WORK.EXERCISE has 6 observations and 3 variables .” if the program ran successfully. To view the SAS dataset (exercise), use PRINT Procedure and view it on the Output window. Please run the Proc PRINT as follows: proc print data=exercise; run; Now you have a SAS dataset called exercise in the temporary “WORK” library. This is a temporary SAS dataset and will be lost if you quit SAS without saving it as a permanent SAS dataset. Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 5 Henian Chen Fall 2002 Using a “libname” statement to save your permanent SAS dataset. libname mine 'a:'; data mine.myexercise; set exercise; run; libname statement specifies the library name of your choice (here it is mine) linked to the directory (here it is a: your floppy disk). data statement gives the name you have chosen for this permanent SAS data set (myexercise). set statement gives the name of the temporary data set. For the purpose of this course we will avoid requiring you to permanently save any file as you don’t have access to the hard drive and your floppy disk may have insufficient disk-space for saving a big permanent SAS dataset. On your home computer, though, you are encouraged to use libnames and permanently save files. Please save this program on your floppy disk as ‘a:exercise.sas’ using the ‘save as...’ option in the ‘file’ menu. The double trailing at-sign (@@): We just have one record for each data line in program exercise.sas. The double trailing at-sign is useful to have more than one record on one data line. For example, we can change exercise.sas to: data exercise; input time sex $ publication @@; cards; 6 f 3 8 m 17 9 f 11 6 m 6 10 m 48 5 f 30 ; run; 2. Using “infile” statement to read a “raw” data file and create a SAS dataset Write a SAS program in the Editor window to create a SAS dataset (named “case_control978”) by using an existing text file (case-control978.dat). /* ***************************************************************************** * Name: Student's name * Date: 9/26/02 * Program: a: case-control978.sas * Purpose: reading raw data to create SAS dataset using "infile" statement ***************************************************************************** */ Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 6 Henian Chen Fall 2002 title1 '****************************************************************'; title2 'Laboratory 1: Case-Control Study of Esophageal Cancer Among Men'; title3 '****************************************************************'; data case_control978; /* create a SAS dataset called 'case_control978' */ infile 'a:case-control978.dat'; /* indicate the location of the raw data */ input record id age tobacco alcohol status sex; run; proc print; /* check the data structure */ run; Please save this program on your floppy disk as ‘a: case-control978.sas’. We started with a comment statement in this program. Comment statements are global statements in the sense that they can occur anywhere. There are two forms of comment statement. The first form begins with an asterisk and ends with a semicolon. The second form begins with /* and ends with */. The title statement is a global statement and provides a title that will appear on each page of printed output. The text of the title must be enclosed in quotes. Multiple lines of titles can be specified with the title2 statement for the second line, title3 for the third line, and so on up to ten. infile statement specifies the file where the raw data are stored. infile statement precedes the input statement. 3. Using a previously created SAS dataset To read data from a SAS data set, rather than from a raw data file, the set statement is used in place of the infile and input statement. /* Reading Data from an existing SAS Data Set */ data new1; /* create a new SAS dataset called new1 */ set 'a:myexercise'; /* read in the data from a:myexercise */ female=0; /* create a new variable called female, let female=0 */ if sex='f' then female=1; /* let female=1 if sex is 'f' */ run; proc print data=new1; /* check the new1 structure */ run; data new2; /* create a new SAS dataset called new2 */ set new1; /* read in the data from new1 */ drop sex; /* delete variable sex */ run; proc print data=new2; /* check the new2 structure */ run; Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 7 Henian Chen Fall 2002 IMPORT Procedure In SAS, the files produced by database programs, spreadsheets, and word processors are not normally “raw” data. SAS cannot read those data files by using infile, and input statements. We have to use IMPORT Procedure to read data from an external data source and write it to a SAS dataset. External data sources can include DBMS tables, PC files, spreadsheets, and delimited external files (in which columns of data values are separated by a delimiter such as a blank, comma). /* import delimited file (tab-delimited values) to SAS */ proc import datafile='a:case-control978.txt' out=delimited dbms=tab replace; getnames=yes; run; proc print data=delimited; run; /* import Lotus file to SAS */ proc import datafile='a:case-control978.wk3' out=lotus dbms=wk3 replace; getnames=yes; run; proc print data=lotus; run; /* import dBASE file to SAS */ proc import datafile='a:case-control978.dbf' out=dbase dbms=dbf replace; run; proc print data=dbase; run; Today we did some typical SAS jobs that you will need to use throughout the semester for homework exercises. You should practice as much as possible with SAS this week in order to get yourself comfortable with its use. We will practice data management and Proc Steps in next week’s laboratory. Please bring back the dataset ‘case-control978.dat’ and the SAS program ‘case-control978.sas’. Applied Epidemiologic Analysis - P8400 Lab1: Introduction to SAS Page 8 Henian Chen Fall 2002