SAS Programming Notes 1 SAS Programming Notes For Data Mining and Exploration Lecturer: Amos Storkey School Of Informatics University of Edinburgh Further topics SAS Programming Notes 2 Further topics Acknowledgements: These notes are extensively based on notes developed over a long period by the School of Accounting, Economics & Statistics, Napier University. People who have worked on or contributed to these notes over that time include Amos Storkey, Ana Costa Da Silva, Phil Darby, Helen Storkey, Jeff Dodgson, Dorothy Currie, Kate Houston and Kirsty Davidson. I am very grateful for permission to use and develop these notes for the Data Mining and Exploration course. First published September 2000 Updated September 2001 (SAS version 8) and February, July September 2002 October 2004 (SAS version 8.1) and September 2005 December 2006 (SAS version 9.1.3) January 2008 (SAS 9.2 and linux differences) File: SASv9.2.doc SAS Programming Notes 3 Contents 1. GETTING STARTED 7 1.1 What is the SAS system? Error! Bookmark not defined. 1.2 The SAS Workspace 7 1.3 Creating and running a SAS program 9 1.4 Submitting and correcting your program 10 1.5 Saving files and clearing text from windows 11 1.6 Reading a saved program 11 1.7 A Data Analysis Flow Chart 12 1.8 Importing data using a wizard 12 1.9 Viewing a data set 14 1.10 Creating a SAS Program 15 1.11 Rules for entering SAS statements 16 1.12 Adding comments to a program 17 1.13 Including titles in your SAS output 17 1.14 Creating new variables 18 1.15 Printing and saving SAS output 18 2. DATA FILES AND SAS DATA SETS 2.1 Reading data files using the INFILE statement 2.2 LIBNAME and permanent SAS Data Sets 2.3 Referencing a permanent SAS data set 2.4 Contents of a file 2.5 Importing data from other packages 2.6 Missing values 2.7 The INPUT statement 23 23 24 26 26 27 28 29 3. SAS PROCEDURES 3.1 Structure of a SAS program 3.2 Sample program 35 35 35 4. SUMMARISING DATA 4.1 SAS System Options 4.2 HTML output 4.3 Summary Procedures 4.4 PROC SORT 4.5 PROC MEANS 4.6 PROC UNIVARIATE 4.7 PROC FREQ 4.8 General syntax for a procedure 4.9 Help 39 39 40 40 41 42 43 43 45 47 5. GRAPHS AND CHARTS 5.1 Graphics procedures 5.2 PROC PLOT 5.3 PROC CHART 49 49 49 50 SAS Programming Notes 4 Contents 6. CORRELATION AND REGRESSION 6.1 PROC CORR 6.2 PROC REG 53 53 54 7. EXPLORATORY DATA ANALYSIS 7.1 SAS/INSIGHT 7.2 Accessing SAS/INSIGHT 7.3 Features of SAS/INSIGHT 7.4 Using SAS/INSIGHT Tools 57 57 57 59 60 61 8. MODIFYING DATA AND OUTPUT 8.1 Introduction 8.2 SET statement 8.3 DROP and KEEP 8.4 Labelling output 8.5 PROC PRINT 8.6 PROC FORMAT 8.7 Recoding data 8.8 Conditional statements 8.9 VALUE statement 8.10 OUTPUT 63 63 63 64 65 66 66 67 67 68 69 9. PROC TABULATE 71 10. FUNCTIONS AND FORMATS 10.1 MEAN function 10.2 NMISS function 10.3 N function 10.4 Functions to handle character variables 10.5 Date and Time Formats 75 75 76 76 76 77 11. ITERATIVE PROCESSING 11.1 Do loops and arrays 11.2 Reading data in repeated patterns 11.3 Arrays 11.4 Generating random numbers 11.5 Random numbers from a uniform distribution 11.6 Random numbers from a normal distribution 11.7 The SAS Program Data Vector 11.8 The RETAIN and Sum statements 81 81 81 82 84 84 85 85 86 12. FURTHER TOPICS 12.1 Combining Data Sets 12.2 Hints on Using Word with SAS and SAS/INSIGHT 89 89 90 SOLUTIONS TO EXERCISES 95 SAS Programming Notes 5 Contents Various files are referred to in these notes. These can be found in a zip file on the Data Mining and Exploration web site www.inf.ed.ac.uk/teaching/courses/dme/ SAS Programming Notes 7 Getting started 1. Getting started 1.1. Introduction The SAS system is a widely used resource for statistical analysis and data mining. It is rare to find a job advert for a data mining practitioner that does not ask for SAS skills. The main positive points of SAS are its ability to handle large files fairly transparently, the ease and comprehensive way that standard analyses can be done, the interactive way that analyses can be built alongside a systematic programming environment, and the data handling capabilities. Its main negative points are its graphical capabilities, and that adding your own extensions to the techniques using macros and the interactive matrix language are slightly more cumbersome than other languages (e.g. matlab, R) and than more modern language constructs. This tutorial will introduce you to the SAS System. This tutorial should be suitable for those working on either a Linux or Windows system. Interface tools in SAS for Windows are much better and so where there are differences these will also be mentioned. SAS is, at its heart a piece of software for data handling and storage, statistical and data analysis, data mining decision support and report writing. It has been extended to a whole business intelligence package, but the best way of understanding SAS is from the inside out, and so this tutorial will teach the base SAS software to get you started. With base SAS software you can store data values and retrieve them, modify data, compute simple statistics, and create reports all in one SAS session. The difference between SAS and most statistical packages is that SAS incorporates both a database management system and a high-level programming language. There is also SAS software which provides graphics, forecasting, data entry, and statistics. The SAS system also contains other sophisticated applications that are valuable to large enterprises. All are available in one system. 1.2. The SAS Workspace To start SAS on a linux system type SAS at the command prompt. On windows, select SAS from the start menu. When you go into SAS, the first thing you see is a set of windows as shown in Figure 1. Your display may appear a little different since this has been adjusted to allow all the windows to be seen at once. There are five different windows shown in this figure. Two further windows are available in SAS SAS Programming Notes 8 Getting started version 9, you can switch between them by clicking on the buttons at the bottom of the SAS window. Run Libraries Editor Output Explorer Results Log Figure 1 SAS window on opening in Windows Explorer Libraries Log Editor Output Figure 2 SAS window on opening in Linux The five windows are: the EDITOR window where you enter the SAS statements you wish to execute. The EDITOR has handy features like colour coding and expandable and collapsible sections. SAS Programming Notes 9 Getting started the LOG window which contains information on your SAS run, e.g. date and time of run, a listing of your SAS statements as they are executed and any errors which have occurred during processing. the OUTPUT window which displays the actual results of the program. the EXPLORER window, which allows you to view and manage your SAS files and create shortcuts to non-SAS files. For example you can use this window to create new libraries or to open any SAS file. the RESULTS window helps you navigate and manage output from SAS programs you submit. You can view, save, and print individual items of output. (By default, the Results window is positioned behind the Explorer window but when you submit a SAS program that creates output it moves to the front of your display) The two windows not shown are: the GRAPH window, will appear when graphical output is to be displayed. A seventh window will appear when html output is used. The output delivery system (ODS) can be turned on using programming code or by using the menu options. You may turn on or turn off a window by using View from the main menu. Just choose the window you need (use this if you ‘loose’ a window). Task 1 Resize the 3 windows on the right hand side so that you see the OUTPUT as well as the EDITOR and the LOG. Make the EDITOR the largest window. You can activate any of the windows by clicking on the window (Windows or Linux) selecting Window from the menu, then the window you want (Windows) selecting View from the menu (Windows or Linux) 1.3. Creating and running a SAS program The following lines of code are a simple SAS program. When they are typed into the editor window the words will become colour coded. SAS Programming Notes 10 Getting started Reserved words appear blue (e.g. proc, print, input) Comments appear green in Windows and in black in Linux (See below for details of entering comments. Errors appear red. data class1; input height weight sex $; datalines; 152 45.4 F 178 73.0 M 178 68.8 M 175 59.7 M 157 44.5 F 165 61.7 M 175 74.1 M 160 49.5 F run; proc print; run; Task 2 Enter the SAS program in the EDITOR window. 1.4. Submitting and correcting your program There are several methods of submitting your program. 1. Highlight the section of code you wish to run and press the running man icon (in Windows). 2. Ensure that your cursor is in the EDITOR window, then select Run Submit (in Windows or Linux). 3. You can also run just a few lines of code by selecting Run -> Submit top line or Submit N lines (in Windows or Linux). Right click with the mouse and select Submit All or press the man running icon An alternative to pressing the man running icon is to press the key F3 in Windows or the key End in Linux. Examine your LOG window to check that there were no error messages: if all is well examine your output in the OUTPUT window. SAS Programming Notes 11 Getting started If you have error messages in your log file you will need to correct the mistakes and resubmit it. After submitting your code you may find that it has disappeared from the editor window. To overcome this problem select Run Recall Last Submit. 1.5. Saving files and clearing text from windows When you have succeeded in getting your program to run you can save it as filename.sas ( SAS automatically gives it a .sas ending to remind you that it is a SAS program). Make sure your EDITOR window is active before doing File Save (or pressing the floppy disk icon). Otherwise you might be saving the contents of your log or output window instead of your program. Save log files as filename.log and output files as filename.lst if you want to save them too. It is usually not necessary to save the log file. Important: - In order to avoid getting confused about which output and log refers to which program or version of a program, make it a habit to clear your windows before submitting a new program. Do this by selecting Edit Clear All. Run Recall Last Submit returns the program you have just run to the EDITOR window. This is useful if you have cleared the program by mistake. Task 3 Create a new folder in your personal disk space called MA71064 Statistical computing. Submit the SAS program from the program editor. When it is working satisfactorily save the file as class1.sas in the folder you have just created. 1.6. Reading a saved program A SAS program needs to be in an EDITOR window before it can run. To open a saved SAS program activate the EDITOR window and use FileOpen. The program can then be submitted in the usual way. SAS Programming Notes 12 Getting started You can have more than one EDITOR window open at the same time. However this can be confusing and it is easiest at first to have only one program open at a time. 1.7. A Data Analysis Flow Chart Data analysis can be thought of in terms of a process flow. Actions proceed in a sequence. Often the output from one action leads to the input of another. A simple flow chart is given below. Start Read or Create a data set Data step Display the data Proc Print Calculate the averages Proc Means End Figure 3 A simple data analysis flow diagram SAS programs can contain combinations of DATA steps and PROCEDURES. The SAS program you used above executed the first 2 blocks in the flow diagram. Quite quickly you will be producing more complicated programs that will have many DATA steps and PROCS. 1.8. Importing data using a wizard The next example reads the excel file Class0 into the temporary SAS library called Work. The format of the data is displayed and the summary statistics (count, average and standard deviation) of the height readings is calculated. The simplest method of entering data into SAS is using the import wizard. SAS Programming Notes 13 Getting started File Import Data will display the dialogue shown in Figure 4. A. The source type default is Excel but others are available from the pull down menu. Next B. Locate the source file by pressing the Browse button OK C. Select the appropriate worksheet From the options ensure that ‘Use data in the first row as SAS names’ is ticked. OK Next D. Enter the Member as Class0. Finish Check the log window for errors. Figure 4 Import data wizard SAS Programming Notes 14 Getting started In Linux, there is naturally not the option of importing from Excel. However, there is the option of importing csv files. An Excel file can be opened using Open Office and can be saved as a Comma Separated File (csv). It can then be imported straight into SAS. In Linux steps B and C above are replaced by the dialogue in Figure 4. Similar options are available when pressing the respective button. The remainder dialogue is similar to that in Figure 4. Figure 5 Import data wizard in Linux The final step of the wizard, in both Linux and Windows, is optional and offers the possibility of saving the importation command in a specified file, which can be opened with the Program Editor. This can be copy-pasted into any program and be run, without need to follow the steps of the wizard again. 1.9. Viewing a data set Once the data is into the SAS format you can look at it in a variety of ways. 1 Proc print; run; 2 From the explorer window, double click on the libraries icon to reveal libraries that are present. These libraries are simply pointers to Windows XP folders where the data sets are stored. Double clicking on the work library reveals the data set Class0. 3 Double click on the data set to open the data set. 4 Right click on the data set to display a set of options. These include; Open, View the Columns View in Excel (only in Windows). SAS Programming Notes 15 Getting started Task 4 Import the excel file class0.xls into SAS using the import wizard then display the imported data set using excel. 1.10. Creating a SAS Program You have already submitted a simple SAS program which created and then printed out a set of data. The following is an extension of that program. The line numbers have been included to help explain the structure of the program: they are not part of the program itself and should not be typed. Line number 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 Program data class2; input height weight sex datalines; 152 45.4 F 6 61 178 7.0 M 8 59 178 68.8 M 12 58 175 59.7 M 5 76 157 44.5 F 5 53 165 61.7 M 10 70 175 74.1 M 5 76 160 49.5 F 2 67 161 52.6 M 5 80 180 85.4 M 7 84 160 57.2 F 7 98 170 69.9 M 7 69 178 67.0 M 11 60 163 57.0 F 8 70 160 60.9 F 12 57 185 73.1 M 5 68 188 79.1 M 3 53 159 49.5 F 6 69 run; proc print; run; proc means; run; $ bends pulse1 pulse2; 84 102 95 83 102 110 102 118 103 102 115 102 79 98 84 . 69 112 SAS Programming Notes 16 Getting started Line number 001 Explanation The DATA statement tells SAS to create a data set called class2. 002 The INPUT statement names the variables in the order they appear in the data lines. Variable names must start with a letter, be no more than 32 characters in length (eight characters in version 6) and must not contain blanks, commas and so on. To read data as characters, rather than numbers, a dollar sign is put after the variable name. 003 004 to 021 The DATALINES statement indicates that the next lines are data. The data are entered with a space(s) separating each item. The data must be in the same order as declared in the input statement. A new line is used for each record. 019 022 023 025 & 026 A full stop indicates a missing numerical value. RUN tells SAS to execute the preceding statements PROC PRINT is a procedure to print data in the Output Window PROC MEANS is a procedure to calculate the mean and other statistics of all the numeric variables, RUN completes the procedure. This example illustrates the basic structure of a SAS program: A DATA step consisting of a DATA statement and other statements that form part of this step SAS PROCECURES begin with a PROC statement. Procedure statements may also be followed by statements that are part of the procedure step, although there are none in these two examples 1.11. Rules for entering SAS statements SAS statements: usually begin with an identifying keyword always end with a semicolon (check carefully before you submit any program!) can be in uppercase or lowercase letters SAS statements are free format. they can begin and end in any column one statement can continue over several lines several statements can be on one line SAS Programming Notes 17 Getting started Readability is improved if you add comments and leave spaces between the DATA and PROC steps and perhaps also indent code within a DATA or PROC step. Develop your own style and stick with it. 1.12. Adding comments to a program There are two ways of writing comments in a SAS program: begin the comment line with an asterisk and end with a semi-colon e.g. *This program was developed by J Smith; begin with a forward slash asterisk and end with an asterisk forward slash e.g. /* J Smith February 2005 */ Inserting comments is essential if you are doing any serious programming. The /* style */ is also useful for ‘commenting out’ blocks of a program when testing or debugging. Task 5 Read the file class1.sas into the program editor. Edit the program so that it is the same as the sample program in section 1.10 but with the addition of a comment which gives your name and today’s date. Submit the program and when it is working properly save it as class2.sas. What information did the Proc Means procedure give you? 1.13. Including titles in your SAS output The TITLE statement is used to provide titles on your output. The TITLE statement can appear anywhere in a program (an example of a global statement) and subsequently each page of output (and each graph) will have the title until it is reset. For example program class2.sas could be enhanced as follows: ... proc print; title ‘Information on Students in Class’; run; proc means; title ‘Summary Statistics of Students in Class’; run; SAS Programming Notes 18 Getting started However the final title will adorn all future output until it is reset with another title or ‘cancelled’ with title; run; Task 6 Experiment with the TITLE statement in program class2.sas. 1.14. Creating new variables If you need to analyse variables, that are derived from the input variables, then you must create these variables in the DATA step. For example, if you want to use two new variables ‘the difference in pulse rates’ and ‘the log of the number of bends’ then these variables must be defined before the lines of data are read in. The rules about naming new variables are the same as for input variables. data class2; input height weight sex $ bends pulse1 pulse2; diff=pulse2-pulse1; lnbends=log(bends); datalines; ... Run; Some commonly used operators and functions are as follows: Operator Meaning Function * multiplication log( ) natural log / division exp( ) exponential ** exponentiation sqrt( ) square root 1.15. Printing and saving SAS output The contents of the OUTPUT window may be sent to a printer using the OUTPUT window print command. You can change the way the output looks using, for example, the LINESIZE and PAGESIZE options (see Section 4.1 or SAS Programming Notes 19 Getting started SAS Help).The whole of the OUTPUT window listing may be saved as filename.lst using the OUTPUT window save command. In WINDOWS, it is often convenient to copy all or part of the OUTPUT window into a Word document or another text processing software. This can be achieved with the copy and paste operation. However the results might be disappointing. The appearance may be improved by using a fixed space font such as SAS Monospace (available if SAS is running) avoiding ‘wrap around’ by reducing the font size and avoiding unnecessary leading spaces. See Section 12.2 for further advice on incorporating SAS numeric and graphical output into a Word document. Task 7 (WINDOWS ONLY) Create a Word document (using copy and paste) which consists of program class2.sas and the output it produces. Experiment with improving the layout of the document. Exercises 1.1 The following table shows the heights and weights of 16 eleven-year-old girls. ID no Height(cm) Weight(kg) ID no Height(cm) Weight(kg) 59 135 25 71 133 30 82 146 33 78 149 35 27 153 56 12 141 33 52 154 51 37 164 48 55 139 31 28 146 37 13 131 25 48 149 45 01 149 43 69 147 36 15 137 32 16 152 47 (a) Write the SAS statements to create a SAS data set called ELEVEN. The ID number should be stored as a character variable. (b) Insert a comment statement to indicate that this data came from Exercise 1.1. SAS Programming Notes 20 Getting started (c) Create a new variable which is the ratio of weight to height. (d) Produce a printout of the data and a table which shows the mean, standard deviation, the maximum and the minimum values for the variables height, weight and the ratio of weight to height. The output should be suitable labelled. (e) (WINDOWS ONLY) Copy and paste your program and its output into a word document (edit to ensure an attractive appearance). (f) How would the output have differed if you had input the ID number as a numeric variable? 1.2 Modify your program in 1.1 above in order to determine the body mass index (BMI = weight in kilograms/(height in metres)2 or BMI = W/(H * H) ). 1.3 Several measurements of water quality were taken at eight different sites along the Firth of Forth. The data are shown below. Site CR WG EG SF PB JO SS FN Salinity 30.11 31.48 31.79 31.37 31.50 31.60 30.50 31.96 Phosphate 0.068 0.059 0.068 0.185 0.116 0.106 0.047 0.060 Nitrogen Chlorophyll 0.297 1.693 0.165 1.464 0.144 1.100 0.278 1.787 0.223 2.099 0.207 1.067 0.162 1.563 0.130 0.753 Faecal Coliforms 2.917 3.149 3.196 3.418 3.049 2.903 2.895 2.797 (a) Write the statements to create a SAS data set called FORTH. (b) The units of phosphate are mg/litre. Create a new variable which gives phosphate in units of g/litre where 1mg = 1000g (1 milligram = 1,000 micrograms). (c) Produce a table which shows summary values for each variable. SAS Programming Notes (d) 21 Save the program in a file called forth.sas. Getting started SAS Programming Notes 23 Data files and SAS data sets 2. Data files and SAS data sets 2.1. Reading data files using the INFILE statement In the examples in the previous section you created temporary SAS data sets from data, which were included in the program, with a DATALINES statement. In practice, a large set of data is more likely to be available as a raw data file (known as an ASCII or text file) and it will be more convenient to read the external data directly into SAS. To illustrate this we will create a small ASCII data set using the Notepad editor, read the data file into SAS and then print the contents. Task 1 Open Notepad / text editor and type in the following data set. Save it as blood.txt on your floppy disk. (Note that Notepad automatically gives the extension .txt.) 1 107 100 2 110 114 3 123 105 4 129 112 5 112 115 6 111 116 7 107 106 8 112 102 9 136 125 10 102 104 The variables in the data set are patient number and blood pressure measurements before and after treatment. The code required to input the data into SAS and get a printout in the Output Window is as follows. data blood; infile 'a:\blood.txt'; input patient $ before after; run; proc print; run; The only changes that are required to the previous method of data input are that: the INPUT statement is preceded by an INFILE statement to tell SAS where to find the external data file. the DATALINES statement and the lines of data are omitted. SAS Programming Notes 24 Data files and SAS data sets Task 2 Type the above SAS program into the EDITOR window and save the program under a suitable name. Submit the program and confirm that the values of the variables together with variable names have been printed in the OUTPUT window. You can verify that the data is stored in the correct location on your hard drive. An example is given in Figure 6. Figure 6 The data set Blood in the SAS Work library and temporary directory 2.2. LIBNAME and permanent SAS Data Sets In the programs you have written so far the data set used in any analysis has been created in the data step. Such a data set is described as temporary in the sense that it only exists during your current SAS session and will be deleted when you close SAS. This kind of temporary file is stored in a SAS library called WORK. You can check what files you have created in the current session by going to the EXPLORER window and clicking on Libraries and then on the library WORK. (Use View Up One Level or View Show Tree to navigate back to the original EXPLORER window) SAS files are given a two part name. The first part of the name is the library name in which the file is stored and the second part is the name of the particular file. You probably noticed that when you created the previous data sets, for example class2, that SAS referred to this file in the log window as WORK.CLASS2. SAS Programming Notes 25 Data files and SAS data sets If you wanted to do further analyses on this type of data set in a different session you would need to recreate the data set by running the data step once again. This can be a time-consuming process especially if you have a large amount of data and have created many new variables, changed the format of variables and so on. The alternative approach is to create a permanent SAS data set. This is a special type of file, unique to SAS, which stores the data, variable names and other information such as formats. You can set up a library to store your data sets and save them so that they can be used in another SAS session. The SAS LIBNAME statement defines the name of the library where the file is to be stored. For example, if you want to store your data on your own disk in drive A, then you need to give this a SAS libname using a statement like the one below. The actual name of the library, in this case mydisk, is chosen by the programmer and is just a convenient name that can be referred to later in the program. LIBNAME myadisk ‘a:\’; LIBNAME mydisk LIBNAME myhomedisk ‘c:\My documents\Napier\MA71064 Stat Comp\’; ‘h:\MA71064 Statistical computing\’; You can then save your SAS data sets in this library using a two level SAS name. The first part of the name is the libname and the second part is the name given to the SAS data set. So to create the permanent SAS data set called class2, on the H: drive, would require the following SAS code. libname mydisk ‘h:\MA71064 Statistical computing\’; * The data library called ‘mydisk’ will; * be located on the H: drive; data mydisk.class2; *Create the new data set class2; input height weight sex $ bends pulse1 pulse2; datalines; 152 45.4 F 6 61 84 178 53.0 M 8 59 102 165 61.7 M 10 70 110 ................... 175 74.1 M 5 76 102 160 49.5 F 2 67 118 run; Task 3 Modify your program, class2.sas, to create a permanent data set. (If you want the data set stored on a hard drive make sure you give the full path name of the required directory.) Check the messages in the LOG Window and check you can see the permanent data set in the EXPLORER window. SAS Programming Notes 26 Data files and SAS data sets SAS version 8 puts an automatic SAS7BDAT extension on permanent data sets (version 6 uses SD2). Task 4 Go to Windows explorer and check that you have a file class2.SAS7BDAT in the appropriate directory. 2.3. Referencing a permanent SAS data set Suppose that you have a permanent SAS data set stored in a particular directory. You may have created this yourself or possibly have downloaded it from the web. You may carry out procedures on the data set directly by using the DATA option in the procedure statement. All the details of variable names and so on will be held in the data set. In the following example permanent SAS data set prac1 is stored in directory h:\sas\sasdata. libname xyz ‘h:\sas\sasdata’; proc print data = xyz.prac1; proc means data = xyz.prac1; run; Note that the first part of the name given by LIBNAME is a pointer to a directory and does not have to be the same name as was used when the data set was created. It is the second part of the name that refers to the particular data set. 2.4. Contents of a file You can use the SAS procedure CONTENTS to get information about a data set and a list of the variables it contains. This procedure is useful for larger data sets that would be too long or have too many variables to list completely, and it gives you information about when and where the data set was last modified. For example, proc contents data = xyz.prac1; run; Note: if you have already submitted a LIBNAME statement in the current SAS session, it is not necessary to do so again. You can simply refer to the twolevel data set name. SAS Programming Notes 27 Data files and SAS data sets Task 5 Get information about the SAS data set COMPANY stored in SASHELP. What information is given in each of the columns? An alternative way of inspecting what variables are in a large file is to print out only the first few observations. This can be done using an option in the PROC PRINT statement. proc print data=sashelp.company (obs=6); run; Remember you can also view data sets from the EXPLORER window 2.5. Importing data from other packages Software such as Excel, Minitab and SPSS store data in file types unique to themselves. Some packages have the ability to export into or import out of other formats. Use the import wizard or PROC IMPORT (see SAS Help). A safe approach to importing data from such application software into SAS is to export from the other package into ASCII format and input the resulting file into SAS in a DATA step. Large data sets from outside sources (other companies or organisations) are usually supplied in ASCII format since such data is often held in proprietary databases. Most software allows data to be written to an ASCII or raw data file. This approach is illustrated bellow: INFILE statement export Application data file ASCII data file SAS data set It is a good idea to check the ASCII file with an editor such as WordPad (and possibly ‘tidy up’ if necessary). The ASCII file can then be read into SAS using the INFILE statement assigning variable names with the INPUT statement (as explained in Section 2.1). Advice on importing data into SAS from popular applications software is summarised below. Excel (WINDOWS) For a spreadsheet containing data only (values of variables in columns): SAS Programming Notes 28 Data files and SAS data sets Right align columns (necessary for character data) File Save AsFormatted Text (Space delimited) to give ASCII file filename.prn. Edit filename.prn with WordPad if column headings need deleting, missing values need replacing with ‘.’ etc. Minitab FileOther FilesExport Special Text Specify columns (accept Period Decimel Separator). Results in ASCII file filename.dat. Note that Minitab’s missing value symbol is ‘*’. SAS will find this invalid and replace by ‘.’. SPSS FileOther FilesFixed ASCII Results in filename.dat. Note that SPSS’s missing value symbol is ‘.’ However this will be blank in filename.dat and cause SAS to misread the data set when using simple list input. Alternative: SPSS allows data to be saved directly as a permanent SAS data set: File Save AsSASv7 Windows long extension In recent versions of SAS (e.g. 9.2), SPSS files can be imported directly. 2.6. Missing values Uncoded missing values present special problems for using list input. To provide some protection for the integrity of your output data set when input data contain uncoded missing input values, use the MISSOVER or STOPOVER options in the INFILE statement. Use the MISSOVER option to set all remaining variables in the INPUT statement to missing. Use the STOPOVER option to prevent an observation from being written to the data set when the input line does not contain a value for each variable in the INPUT statement and to stop the DATA step from further processing. e.g. the program data test1; input id $ var1 var2 var3 var4 var5; datalines; 1001 115 45 65 83 78 1002 86 27 55 86 SAS Programming Notes 29 Data files and SAS data sets 1004 93 52 63 76 88 1015 73 35 43 112 108 ; run; would result in the following inaccurate data set obs 1 2 3 id 1001 1002 1015 var1 115 86 73 var2 45 27 35 var3 65 55 43 var4 83 86 112 var5 78 1004 108 var3 65 55 63 43 var4 83 86 76 112 var5 78 . 88 108 If we use the MISSOVER option i.e. data test1; infile cards missover; input id $ var1 var2 var3 var4 var5; cards; 1001 115 1002 86 1004 93 1015 73 ; run; 45 27 52 35 65 83 78 55 86 63 76 88 43 112 108 we will get the following data set obs 1 2 3 4 id 1001 1002 1004 1015 var1 115 86 93 73 var2 45 27 52 35 Using the MISSOVER option prevents the uncoded missing value in the second data line from causing the third record to be read incorrectly as well. The second observation is still incorrect, but the errors have been restricted to one observation. The STOPOVER option would prevent observation 2 from being written to the data set at all. In order to read the data in properly, either column input or formatted input would have to be used. (See next section) 2.7. The INPUT statement The INPUT statement names the variables being read in via a DATALINES or INFILE statement and tells SAS where on the DATALINES, or on the lines of INFILE, the values of the variables can be found. There are three main types of INPUT that you can use to describe a record’s values : LIST, COLUMN and FORMATTED. The choice of which type of input you use will depend on the type and arrangement of the incoming data. The $ symbol is placed after a variable name to indicate a character variable. SAS Programming Notes 30 Data files and SAS data sets In the previous examples you have used only the simplest type of INPUT, LIST INPUT. List INPUT is seldom useful for large commercial or scientific work because it is too easy to get missing values or errors in big files. It is commoner for real data to come in fixed column format, where the fields on each line are aligned in columns one under each other. 2.7.1. LIST INPUT - the values are separated by spaces - missing values must be represented by full stops - by default, character values cannot be longer than 8 characters - character values cannot contain embedded blanks - fields must be read in order e.g. data one; input height weight name $ age; datalines; 65 150 60 125 68 180 ; run; Chris Kelly Leslie 50 35 29 2.7.2. COLUMN INPUT - data must be aligned within the column positions specified - character values can contain embedded blanks - input values can be read in any order - character values can be of length 1 to 200 characters - leading and trailing blanks within a field are ignored e.g. data two; input name $ 1-7 age 9-10 birthdate $ 11-22 sport $ 23-30 ; datalines; Ronald 40Dec 3 1954 golf Michael 37Jul 4 1957 fishing Laurel 33Jun 23 1961 softball ; run; 2.7.3. FORMATTED INPUT - character values can be of length 1 to 200 characters SAS Programming Notes 31 Data files and SAS data sets - a full stop is not needed for numeric missing values - nonstandard data, such as dates or numbers can be read in - with the use of pointer controls, values can be read in any order This method of input uses pointer controls and informats for reading in nonstandard data from external data files. An informat is used for reading in data containing dates, numbers with commas, etc. The informat w.d after a variable specifies the width w and the number of decimal places d to be used in reading in a number. e.g. for the number 2346, the informat 4.2 would result in the number 23.46 being read in. the informat 4. with no ‘d’ specified would result in the number 2346 being read in The informat $w. after a variable specifies the length of a character variable Dates such as 21/10/89 can be read using the informat DDMMYY8. (Note the full stop at the end of the informat) Pointers indicate the position of a variable e.g. @n go to column n +n move the pointer on n positions e.g. /* A line of place counters is often useful to put to help alignment 000000000111111111122222222223333333333444444444455555555 1234567890123456789012345678901234567890123456789012345*/ data three; input @1 name $7. @10 age 2.0 @14 @28 sport $8. / @9 gradyr 4.0 @20 occupation $20.; datalines; Ronald 40 Dec 1973 2 Michael 37 Jul 1975 2 Laurel 33 Jun 1979 0 ; run; birthdate $11. @16 numchild 1.0 3 1954 golf masonry contractor 4 1957 fishing bricklayer 23 1961 softball attorney / tells the pointer to go to the next line. Once you go to the next line, you cannot move back to the previous line. SAS Programming Notes 32 Data files and SAS data sets Exercises 2.1 Create a permanent data set of the data given in Exercise 1.1. How are you going to retrieve this data without having to retype it? You should be able to modify the program you have saved. 2.2 (a) Download the pulse data file Minitab version from the web or WebCT (pulse.mtw not pulse.prn). Open Minitab load pulse.mtw using File Open Worksheet (not Open Project). Use File Other Files Export Special Text (not File Save Current Worksheet as) to export the PULSE file as an ASCII file. You have to highlight the variables to export then press select. Press OK. Enter a suitable file name. Change the file type to ANSI Text Files (*.TXT). Finally press save. (b) Create a permanent SAS data set of the data. 2.3 To illustrate the dangers of list format input, take the data file blood.dat and edit it with a text editor (notepad or Word). Make one or two mistakes in it by removing some of the entries in one or more lines. Now save it as a text file and use it to input and print a SAS data set. Examine your log file and output, to see what has gone wrong, and how you are warned. 2.4 (a) (WINDOWS ONLY) Create an Excel file containing the following data where column1 is size, column 2 is colour, column 3 is price and column 4 is transport cost. Save it as a formatted text space delimited file. Large Medium X-Large Small Red Blue Black Orange 18.97 24.68 29.99 15.89 0.25 1.10 1.75 0.90 (b) Write and submit a SAS program to read in the data using list input and print the variables colour, size and price in that order. (c) Redo (b) using column input (d) Redo (b) using formatted input SAS Programming Notes 33 Data files and SAS data sets 2.5 Copy the text file houses.dat from the web. The file contains the following five variables for each of the 120 houses in a survey of house prices. Examine it with an editor (Wordpad or Word). VARIABLE CONTENTS COLUMN LOCATION style Type of house 1 sqfeet Floor area 3-6 bedroom Number of bedrooms 8 baths Number of bathrooms 10-12 Price Price of house 14-19 Use column input to create a permanent SAS data set for the housing data and print the contents. 2.6 (a) Download the cars Excel file from the web. To create a file of raw data for reading into a SAS data set:Open the file up in Excel. Right align the columns. Delete the coding information about the origin variable (in column L). Save the data as a formatted text space delimited file (.prn extension), or as a csv file in Linux. (b) Use this file to create a SAS data set (use column input). To identify the column location for each variable, open the .prn file up in Notepad. Move the cursor along the row of data, taking a note of the column locations. SAS Programming Notes 35 SAS procedures 3. SAS procedures 3.1. Structure of a SAS program Once you have got the data organised a simple SAS program consists of a series of procedures. You have already used three of these procedures. PROC PRINT, PROC MEANS and PROC CONTENTS. Apart from specifying which data set to use you had no control on the type of output that SAS produced. This may have given the impression that SAS is rather inflexible. However, this is far from the truth. Most procedures have several options which can be invoked and in addition there are statements which can be incorporated into a program (which themselves have options). The procedures and subsequent statements determine the nature of the output produced. Most SAS procedures use the following syntax: PROC PROCNAME options; STATEMENTS / statement options; RUN; A program will typically consist of several such blocks of code. 3.2. Sample program libname unit3 'c:\sas\sasdata'; proc sort data=unit3.pulse out=sorted; by activity; run; proc print data=sorted noobs N; *NOOBS removes observation numbers; format height 6.0; title 'Pulse data from Minitab sorted by activity'; var pulse1 pulse2 weight height; by activity; run; proc freq; tables ran smokes activity; tables sex/nocum nopercent; tables sex*smokes; run; proc means maxdec=2 mean std; title 'Pulse rates before and after exercise'; var pulse1 pulse2; run; Task 1 SAS Programming Notes 36 SAS procedures Print the pulse data that you saved as a permanent SAS data set in Exercise 2.2. Now run the first two parts of the sample program (PROC SORT and PROC PRINT) and compare the output. Remember to specify an appropriate library. One way of printing separate tables for different subgroups is to use a BY statement. In order to do this the data set must be already sorted by this BY variable. If you do not want to overwrite the original file then the sorted data must be stored in a new file. The statements: proc sort data=unit3.pulse out=sorted; by activity; run; sort the pulse data by activity level and store the sorted data in a new file called sorted. Task2 Look in the libraries to see where this file is stored. Is it a permanent or a temporary data set? The option NOOBS suppresses the observation numbers and the option N allows the sample size to be printed at the end of each table. The format statement gives an instruction to print the values of height with a maximum of six characters and no decimal places. Task 3 Type in the rest of the sample program and see if you can work out what the remaining statements and options are doing. Look carefully at the titles. What happens if no title statement is made in a procedure? Individual procedures will be looked at in more detail in the next few sections. Information about the options available for individual procedures is given in the SAS help though it is not always very easy to follow! It is not strictly necessary to have a run statement between each procedure. SAS recognises that a new procedure statement indicates that the previous SAS Programming Notes 37 SAS procedures statements refer to the preceding procedure. However, it is generally advisable to include additional run statements and it is essential to put a run statement at the end of the program. SAS Programming Notes 39 Summarising data 4. Summarising data 4.1. SAS System Options You have probably noticed that the date and a page number are included on all the output produced by SAS. The type of output produced by SAS is determined by the system but may be changed by making use of SAS System Options. There are dozens of options available which deal with hardware and software interfacing, and the input and processing as well as just the output of jobs. A list of the options may be found in help. . The following are some commonly used options which may be used to change the output. Option Action CENTRE/NOCENTRE Output centred / left aligned DATE/NODATE Date shown / date not shown NUMBER/NONUMBER Pages numbered / not numbered PAGESIZE= Determines the number of lines per page LINESIZE= Determines the printer line width FIRSTOBS= Specifies the first observation to include from the data set OBS = Specifies the last observation to include. This is useful for testing code using large data sets. OBS = max Includes all observations The following lines of code will produce a print out of observations 20 to 45 inclusively of the pulse data, with no page numbering, no date, left aligned and with 20 rows on the page. options nonumber nodate nocentre pagesize=20 linesize=80 firstobs=20 obs=45; libname unit4 'c:\sas\sasdata'; proc print data=unit4.pulse; run; Options firstobs = 1 obs = max; /* Uses all observations in any analysis that follows./*; SAS system options remain in place for the whole of a SAS session unless subsequently changed. If an OPTIONS statement is entered within a DATA or PROC step then it takes effect immediately. An OPTIONS statement entered outside of a step takes effect with the following step. SAS Programming Notes 40 Summarising data 4.2. HTML output HTML output can be turned on from the menu Tools Options Preferences. Select the Results tab and select the Create HTML box. The dialogue windows are shown below, for both Linux and Windows. It includes an option to write the output into a specified folder. If this option is not used the output file is written into the folder specified for the work library. I Figure 7 Dialogue to turn on HTML output (in Windows / in Linux) 4.3. Summary Procedures Four procedures PROC SORT, PROC MEANS, PROC UNIVARIATE and PROC FREQ may be used to summarise data. The most commonly used options and statements for these procedures together with sample programmes are given below. The complete set of options can be obtained in SAS help, HelpSAS Help and Documentation Choose the SAS Products, Base SAS, SAS Procedures then Procedures. From there you should click on the procedure you require. SAS Programming Notes 41 Summarising data 4.4. PROC SORT Options Description DATA= Data set to be used, uses the last data set created by default OUT= Specifies the name of file to store the sorted data. If no OUT option is used the original file will be overwritten. Statements BY <DESCENDING> A list of variables to sort by must be specified. DESCENDING placed before a variable name will sort the data in descending order for that variable. options centre pagesize=50 firstobs=1 obs=92; proc sort data=unit4.pulse out=sorted; by sex descending ran; run; proc print; run; The lines of code sort the pulse data by sex (males first, followed by females) and within sex by whether the students ran. Those who did not run (coded 2) are placed before those that did run (coded 1) because DESCENDING has been specified. Note that no data statement is used with PROC PRINT. SAS automatically uses the sorted data set because that was the last data set created. Task 1 Using the first 50 observations only of the pulse data, create a data set sorted by smoking (non-smokers first) and by activity. Print out the sorted data set. Check carefully that the output is what you expect. SAS Programming Notes 42 Summarising data 4.5. PROC MEANS Options Description DATA= Data set to be used, by default will use the last data set created MAXDEC= Gives the maximum number of decimals to be used in the output (must be between 0 and 8) NOPRINT Suppresses the printing if the procedure is only being used to send summary output to a file (see OUTPUT statement). ALPHA Gives value for confidence limits (ALPHA=0.05 for 95% C.I.) statistic keyword list By default PROC MEANS prints out the variable name, count, mean, std dev, min and max values. Particular statistics may be requested. Procedure options N Number of non-missing observations in a subgroup NMISS Number of missing observations MEAN Mean STD Standard deviation MIN Minimum value MAX Maximum value RANGE Range STDERR Standard error CLM Confidence limits for the mean (For additional keywords see SAS Help) Statements VAR Specify a list of numeric variables for which statistics are required. BY Specify a list of alphanumeric variables (data must be sorted by these variables). Descriptive statistics are given for each subgroup. CLASS Specify a list of alphanumeric variables. Descriptive statistics are given for each subgroup. Uses more memory than the BY command but does not need the data to be sorted. OUTPUT There are various ways of storing all or some of the summary statistics requested. Need to specify a file name using OUT=filename and which variables/statistics are required. See specimen programme for a simple example of how this can be done. The following lines of code may be used to get summary statistics (the means, standard deviations and standard errors) for pulse1 and pulse2 in subgroups defined by sex and whether the students ran. These summary values are stored in a file named summary. proc means data=unit4.pulse maxdec=2 mean std stderr; var pulse1 pulse2; class sex ran; output out=summary mean = mean_p1 mean_p2 std = std_p1 std_p2 stderr = se_p1 se_p2; SAS Programming Notes 43 Summarising data run; proc print data=summary; run; Task 2 Submit the previous sample program. (Remember you may need to change the library name and file name of the data set.) What data has been stored in the file ‘summary’? What does the TYPE variable indicate in the print output? 4.6. PROC UNIVARIATE Options Description DATA= Data set to be used, uses the last data set created by default PLOT Produces stem-and-leaf plots, boxplot and normal probability plots of the data. NOPRINT Suppresses all printing. Statements VAR Specify a list of numeric variables for which statistics are required. BY Specify a list of character variables (data must be sorted). OUTPUT Need to specify a file name using OUT=filename and which statistics/variable names are required. Task 3 Submit the following program and see how the printout differs from that produced by PROC MEANS. How does the output file containing summary values differ? proc univariate data=sorted plot; var height weight; by sex; output out=summary mean = mean_ht mean_wt std = std_ht std_wt; run; proc print data=summary; run; 4.7. PROC FREQ Options Description DATA= Data set to be used, uses the last data set created by default SAS Programming Notes 44 Summarising data Statements TABLES Specify a list of alphanumeric variables for which tallies are required. Smaller subgroups may be defined by the use of an * e.g. sex*ran*activity Table statements options NOCOL Does not show column percentages NOCUM Does not show cumulative frequencies or percentages NOFREQ Does not show cell frequencies NOPERCENT Does not show cell percentages NOROW Does not show row percentages CHISQ Gives results of chi-squared tests of independence The following code produces frequency tables for sex and smoking habit separately and a two way table of sex and smoking habit. The output also includes the results of a chi-squared test of independence for these two variables. proc freq data=unit4.pulse; tables sex smokes smokes*sex/nocol norow nocum chisq; run; 4.7.1. Chi-square test Proc freq is used to carryout a chi-square test for the association of 2 categorical variables. In this case the null hypothesis is that there is no association between smoking and sex. The same proportion of smokers should be found amongst males and females. It is convenient to add the row percentage to the cross tabulation as an easy way to look for a possible association. This is achieved by removing the option “norow”. The options “nocol” and “nopercent” have been left in the statement to remove clutter from the output. proc freq data=unit4.pulse; tables sex*smokes/ chisq nocol nopercent; run; The output from SAS gives SAS Programming Notes The FREQ Procedure Table of Smokes by Sex Smokes Sex Frequency‚ Row Pct ‚1 ‚2 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 20 ‚ 8 ‚ ‚ 71.43 ‚ 28.57 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2 ‚ 37 ‚ 27 ‚ ‚ 57.81 ‚ 42.19 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 57 35 45 Summarising data Total 28 64 92 Statistics for Table of Smokes by Sex Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 1.5321 0.2158 Likelihood Ratio Chi-Square 1 1.5699 0.2102 Continuity Adj. Chi-Square 1 1.0089 0.3152 Mantel-Haenszel Chi-Square 1 1.5154 0.2183 Phi Coefficient 0.1290 Contingency Coefficient 0.1280 Cramer's V 0.1290 The probability of the chi-square statistic being as large as 1.5321 by chance alone is 0.2158. This indicates that there is not an association between sex and smoking in this sample. Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 20 Left-sided Pr <= F 0.9310 Right-sided Pr >= F 0.1576 Table Probability (P) 0.0886 Two-sided Pr <= P 0.2502 Sample Size = 92 Some times your data may simply be the counts of each table cell. The data set must contain a variable called something like “count” which contains the number of observations in each cell. In this case use the weight statement in the procedure, e.g. weight count; 4.8. General syntax for a procedure A general example of a procedure is given below. Each procedure uses only a certain combination of statements but the action of each statement is common across the procedures in which it can be used. SAS Programming Notes 46 PROC PROCNAME DATA = lib1.data1 OUT Weight WeightVar ; Summarising data = lib2.data2 noprint; Specifies which variable weight the analysis ; FORMAT NumVar 8.0 CatVar $3. ;* Specifies formats of variables.; BY CatVar ; * Give 1 output per group, SORT needed; CLASS CatVar ; * Similar to BY statement, no SORTing needed. Makes a numeric variable act as a categorical variable.; VAR NumVar ; * Restricts analysis to named variables; OUTPUT OUT = Summary keyword= DescriptiveStatistic; * Named output dataset, specifies names; WHERE NumVar2 > 1000 ; * Only uses certain cases; TABLE CatVar * NumVar ; * Specifies an output table; FREQ NumVar3 ; * Variable giving the observation Frequency; MODEL YVar= XVar + . .; * Fits models; PLOT YVar * XVar ; RUN; * Plots a scatter plot; SAS Programming Notes 47 Summarising data 4.9. Help Extensive documentation on each procedure can be found by using Help SAS Help and Documentation. An example of the help screen for the BASE SAS procedures is shown in Figure 8. Other useful help modules are SAS/STAT and SAS/GRAPH Figure 8 Base SAS Procedures help Exercises 4.1 (a) For the pulse data, get a printout which shows the number of observations, the mean, standard deviation, maximum and minimum values of pulse2 in each of the four subgroups defined by whether the student smoked/did not smoke and ran/did not run. Make the printout left-aligned with the summary values shown to one decimal place. (b) Obtain confidence limits for the four means produced in part (a). Does it appear that smoking or running on the spot had any effect on the second pulse rate? (c) Obtain comparative boxplots of the second pulse rate in each of the four subgroups. SAS Programming Notes 48 Summarising data (d) What percentage of smokers were made to run on the spot? Get suitable SAS output to give you this information. 4.2 For the SAS data set RETAIL (From the explorer tab look in the library SASHELP) get a printout which shows the number of observations and the mean and standard deviation of retail sales in each year. Print the mean and standard deviation to two decimal places. Output the mean sales for each year to a new file called ‘summary’ and get a print out of this file SAS Programming Notes 49 Graphs and charts 5. Graphs and charts 5.1. Graphics procedures SAS can produce two types of graphics, high or low resolution. High resolution graphics are sent to a special graphics windows where the graphs can be edited and copied into Word documents. GPLOT and GCHART are two procedures which produce high resolution graphics: the equivalent low resolution procedures are PLOT and CHART. 5.2. PROC PLOT Options Description DATA= Data set to be used, uses the last data set created by default Statements PLOT Specify yvariable*xvariable. Can produce several plots with a single statement by including a list of variables in parentheses e.g. (list of n yvariables)*(list of m xvariables) will produce nm separate plots. BY Specify a list of character variables (data must be sorted) to produce separate graphs for subgroups. PLOT options =‘symbol’ Specify a symbol to be used for plotting =variable Identifies each point by the value of another variable The following code produces a plot of weight against height for all students, separate plots of weight against height for each sex and a single plot with a different symbol for males and females. proc sort data=unit5.pulse out=sorted; by sex; proc plot data=sorted; plot weight*height; proc plot; plot weight*height=’*’; by sex; run; proc plot; plot weight*height=sex; run; SAS Programming Notes 50 Graphs and charts Task 1 Input the program into SAS and examine the output. (Remember to assign a LIBNAME as the first statement.) Resubmit the program using PROC GPLOT instead of PROC PLOT. What differences in the output do you observe? When it has run successfully in PROC GPLOT, you will find that you are in a graph window. The graph can be edited in SAS by clicking on the painting icon. To come out of editing the plot, click on file and then down to end. You can save the graph to a file or cut and paste it into Word where it can be further edited if required. The graph window must be closed down (by clicking on ) before another SAS program can be run. 5.3. PROC CHART Options Description DATA= Data set to be used, uses the last data set created by default Statements HBAR Specify variable to produce a frequency bar chart (horizontal bars) VBAR Specify variable to produce a frequency bar chart (vertical bars) PIE Specify variable to produce a pie chart. BLOCK Specify variable to use on the x-axis. Used in conjunction with GROUP and SUMVAR options to produce three-dimensional bar charts. BY Specify a list of character variables (data must be sorted) to produce separate charts for subgroups. HBAR/VBAR/PIE/options SUMVAR= Specify an analysis variable the sum of which is to be shown on the y-axis TYPE= May be used on its own or in conjunction with SUMVAR to produce statistics other than the frequency or sum on the y-axis. The options for TYPE are FREQ (frequency counts), PCT (percentages), CFREQ (Cumulative frequencies), CPCT (Cumulative percentages), SUM (Totals), MEAN (Means) The default is TYPE=SUM if SUMVAR is used otherwise the default is TYPE=FREQ. LEVELS= Specifies the number of equal width classes for numeric variables. MIDPOINTS= Specifies the midpoints of classes for numeric variables MIDPOINTS=lower_limit TO upper_limit BY interval DISCRETE Prevents SAS from dividing a discrete variable into inappropriate intervals e.g ensures a variable coded from 1 to 5 will produce 5 classes. GROUP= Produces separate bar charts on the same graph for different discrete values of the GROUP variable. SAS Programming Notes 51 Graphs and charts The following program illustrates some of the features of PROC CHART. proc chart data=sorted; hbar smokes; by sex; run; proc chart; vbar height/levels=6 group=sex sumvar=pulse2 type=mean; run; Task 2 Run this program and look carefully at the output obtained. Adapt the code to show other statistics on the y-axis of the vertical bar chart, a different number of bars and so on to familiarise yourself with the procedure. Exercises 5.1 Plot weight against height for the data from Exercise 1.1. This should be stored somewhere as a permanent SAS data set. Use a plus sign as your plotting symbol. 5.2 For the pulse data: (i) Plot the second pulse rate against weight, using a different symbol for males and females. (ii) Produce a pie chart which shows the percentage of students who usually have particular levels of activity. (iii) Produce a horizontal bar chart which shows the mean of the second pulse rate for those students that did and did not run. (iv) Produce vertical bar charts side by side which show the percentage of students who have different levels of physical activity for males and females separately. (v) Produce a three dimensional bar chart which shows the mean of the first pulse rate for subgroups defined by level of activity and smoking habit. SAS Programming Notes 53 Correlation and regression 6. Correlation and Regression Two procedures may be used to obtain information about the relationship between two or more continuous variables. PROC CORR determines the correlation coefficients between selected variables and PROC REG fits a regression model to data and allows output to be saved for further analyses. Remember that the statements and options given in these notes are only a very small subset of those that can be used with particular procedures. The help facility may be used to investigate further possibilities. 6.1. PROC CORR Options Description DATA= Data set to be used, uses the last data set created by default SPEARMAN Calculates the Spearman rank correlation coefficient. By default the Pearson product moment correlation is calculated. Statements VAR Specify a variable list (essential) WITH Specify a variable list to be used with VAR. The VAR variables are given at the top of the table of correlations and the WITH variables at the side. If WITH is not used a matrix of the correlations between all pairs of variables is produced. BY Specify a list of character variables (data must be sorted) to produce separate tables of correlations. An example of the use of PROC CORR is shown below using the data concerned with heights and weights of eleven-year-olds (Exercise 1.1). proc corr data=unit6.eleven; var height weight; run; proc corr data=unit6.eleven; var height ; with weight; run; Task 1 Try running this program and look at the difference in output produced when the WITH statement is included or not included. What do the values under the correlation coefficients indicate? SAS Programming Notes 54 Correlation and regression 6.2. PROC REG Options Description DATA= Data set to be used, uses the last data set created by default CORR Prints a correlation matrix for all variables listed in the MODEL statement. Statements MODEL For simple linear regression, specify a response variable and an independent variable, response=independent variable OUTPUT Need to specify an output file using OUT=filename and a list of keywords(statistics)/names required. Commonly used keywords are p (predicted values), r (residuals), student (standardised residuals). See the specimen program for syntax. PLOT Allows scatters plots to be produced using any variables in the model or keywords in the OUTPUT statement. Note the keywords in the OUTPUT must be followed by a full-stop when used as variables in the PLOT statement. BY Specify a list of character variables (data must be sorted) to produce separate tables of correlations. PLOT options OVERLAY Superimposes several scatter plots on the same graph. The following code produces output which shows: the correlation matrix of weight and height output from a simple linear regression analysis of weight on height a plot of weight against height with the predicted values (line) superimposed a plot of the standardised residuals against height a printout of the output file a normal probability plot for each of the residuals and standardised residuals together with the results of normal probability test. proc reg data=unit6.eleven corr; model weight=height; output out=regout p=pred r=resid student=stdresid; plot weight*height='*' p.*height= 'P'/overlay; plot student.*height; run; proc print data=regout; proc univariate plot normal data=regout; var resid stdresid; run; Task2 SAS Programming Notes 55 Correlation and regression Obtain the output from the previous program. Why is it important to produce residual plots and normal probability plots for the residuals from a linear regression model? Exercises 6.1 Data set beetle, available as a permanent SAS data set on the web page, gives information on a sample of beetles and the damage they cause. (a) Download the file and examine the data. (b) (i) plot area against male beetle length (ii) regress area against male beetle length (iii) obtain a plot of the standardised residuals against length (iv) obtain a normal probability plot of the standardised residuals. (c) Why isn’t a simple linear regression model satisfactory? What might be a useful thing to try in order to improve the model? (d) Similarly, investigate the relationship between the amount of frass produced and female beetle length. SAS Programming Notes 57 Exploratory data analysis 7. Exploratory data analysis 7.1. SAS/INSIGHT SAS Insight software is a tool for data exploration and analysis. It is interactive which means you can explore data through graphs and analyses linked across multiple windows. This facility can be used to identify outliers, highlight subgroups on a graph and so on. SAS/INSIGHT allows you to analyse univariate distributions, investigate multivariate distributions and fit explanatory models such as a simple linear regression model to your data. 7.2. Accessing SAS/INSIGHT SAS/INSIGHT may be used with any SAS data sets that have been created previously. If you wish to investigate a permanent SAS data set make sure that you have set up an appropriate library using a LIBNAME statement before invoking SAS/INSIGHT. It is also possible to enter data directly into the SAS/INSIGHT data window. Within SAS, select SolutionsAnalyseInteractive data analysis. The data of interest can then be accessed using LibraryData SetOpen or a new data set created in the data window using New. 7.3. Creating a Scatter Plot To investigate SAS/INSIGHT initially, choose a data set you are fairly familiar with and try out some of the features. This example used the Pulse data set. Analyse > Interactive Analysis Figure 9 Opening the SAS/INSIGHT dialogue in Windows SAS Programming Notes 58 Exploratory data analysis Figure 10. Opening the SAS/INSIGHT dialogue in Linux You can create charts using the analyze menu. An example of creating a scatter plot is shown in Figure 11. Analyse > Scatter Plot (X/Y) Figure 11 The Scatter Plot ( X / Y ) dialogue SAS Programming Notes 59 Exploratory data analysis Figure 12 Two SAS/INSIGHT scatter plots with 3 points highlighted A key feature of SAS/INSIGHT graphs is that they are interactive. Click on a point on one chart and all corresponding points become highlighted. An example of this interactivity is shown in Figure 1. The 3 points labelled with ‘1’ use a larger symbol and indicate the level of the variable ran (1 = yes). To click on several points just hold down the ctrl key while clicking. To highlight an area of points left click and hold in one corner of the region then drag the cursor to the opposite corner. All points in the rectangle will be highlighted. You can then turn off the points by clicking on a point ouside the rectangle. 7.4. Features of SAS/INSIGHT SAS/INSIGHT software provides an extensive range of tools for investigating data and carrying out analyses. Some of the activities that you can carry out using SAS/INSIGHT are shown below. enter data from the keyboard identify observations in plots examine all values for selected observations brush observations in graphs create overlaid line plots rotate data in three dimensional plots manipulate histograms to explore the distribution of data compare distributions in box plots and mosaic plots compute descriptive statistics fit parametric (normal, lognormal, exponential, Weibull) and kernel density estimates fit parametric cumulative distribution functions SAS Programming Notes 60 Exploratory data analysis create quantile-quantile plots calculate correlations and principal components to find the structure of your data fit a general linear model create residual and leverage plots transform variables process data by groups for every analysis 7.5. Using SAS/INSIGHT Once you have a data set in SAS/INSIGHT, manipulations and analyses are carried out by using either Edit or Analyse on the main menu. Operations are also available from pop-up menus by: clicking the left mouse button in the corners of graphs and tables pressing the right mouse button over an appropriate object. Variables to be analysed may either be selected before clicking on Analyse or entered as requested within the particular analysis window. In the data set window: a variable is selected by clicking on the name several variables can be selected by holding down the left mouse button and dragging across the selection non-contiguous variables or observations may be selected by holding down the control button and clicking on individual names or row numbers. In WINDOWS, any plots produced can be printed directly or copied and pasted into Word documents. Tabular output can be saved into the normal SAS Output Window using FileSaveTables Commands from your SAS/INSIGHT session can be recorded and later resubmitted. The FILE and INFILE options allow you to produce a file containing commands to document and reproduce a SAS/INSIGHT session. This is very useful for exploratory analysis that you need to interrupt or repeat on different sets of data. Example filename note ‘h:\MA71064 Statistical Computing\insight.txt’; proc insight file = note; run; SAS Programming Notes 61 Exploratory data analysis After doing your analysis and then exiting, your file will contain the commands that were used to create and close your Insight session. You can begin your second Insight session from where you left your first session with the following code: filename note ‘h:\MA71064 Statistical Computing\insight.txt’; proc insight infile = note; run; Alternatively, just code the FILE keyword without a filename specified and your commands will be recorded in the SAS Log window. Task 1 To get a feel for what SAS/INSIGHT can do, work through the following exercises which are based on the pulse data. (a) Obtain a histogram of height. Obtain a histogram of sex in a new window. Click on the bar representing males on the sex histogram. Look at the histogram of height. What has changed on this histogram? (b) Obtain comparative boxplots of weight for each sex. (Input weight as the Y-variable and sex as the X-variable.) Click on the outlier for male weights. Which observation number is this? Double click on this outlier. What information do you get? Highlight observation numbers 1, 31 and 67 in the data window.. Press the right mouse button and click ‘Label in plots’. Obtain a scatter plot using pulse2 as the Y-variable, pulse1 as the X-variable, sex as the group variable and ran as the label. What information is shown on these plots? Double click on one of the points. What information do you get? (c) (d) Highlight the variable names for pulse1, pulse2, height and weight in the data window. Obtain a scatter plot. What plots do you get? What are the values shown in each plot? (e) Using the Fit option in the Analyse menu, input pulse2 as the Y-variable, weight as the X-variable and ran as the group variable. Look carefully at the output obtained. Redo the analysis to show Residual Normal QQ plots and store predicted values and standardised residuals in the data sheet. 7.6. Tools Edit > Windows > Tools turns on a menu that allows data points to be coloured or selects different symbols by data value. SAS Programming Notes 62 Exploratory data analysis Figure 13 The tools dialogue labelling observations by sex Press the coloured square from the tools window to set a particular set of points to that colour, eg colour red all those points representing people who smoke. Similarly press one of the symbol buttons to select a given symbol for a set of observations. Other features of SAS/INSIGHT can be investigated using the help facility. Task 2 (a) Find out what is meant by ‘brushing observations’. (b) Produce summary statistics and graphs for each of the continuous variables in the pulse data. (c) Input, into the data sheet, the following data which are chloride content (mg/l) of waters draining from a particular type of rock. 6.0 0.4 0.2 5.0 6.0 0.8 0.5 1.2 0.2 0.5 0.2 1.7 0.6 0.7 0.5 10.0 0.3 6.0 (i) Produce a boxplot of the data and comment on the distribution. (ii) Create a new variable which is the log of the chloride content. (iii) Check whether the log values could reasonably be assumed to have come from a normal distribution. SAS Programming Notes 63 Modifying data and output 8. Modifying data and output 8.1. Introduction In the previous units you have discovered how to create permanent SAS data sets and produce output using a variety of procedures. The OPTIONS statement was introduced in the unit on Describing Data which allows you to make some changes to the output produced. Generally speaking though, there has been very little flexibility in either changing the style of the output or modifying the data set that has been used. In this unit you will learn how to make changes to a permanent SAS data set and to customise some types of output. 8.2. SET statement The SET statement is a very versatile statement which is used in a DATA step and enables a variety of tasks to be carried out depending on which options are used. One of its most common uses is reading observations and variables from existing SAS data sets so that further processing can take place. Another use is combining two or more data sets so that analyses can be carried out on a larger set of variables or observations. The same operations can often be done in different ways because some SAS statements can be incorporated as options into either the DATA or SET statements. Have a look at the following examples of code which both achieve the same thing. data unit8.beetles1 (drop=site); set unit8.beetles; where site=’1’; run; data unit8.beetles1 (drop=site); set unit8.beetles (where=(site=’1’)); run; Both pieces of code produce a new permanent SAS data set called beetles1 which has data from site one only and does not include site as a variable. The number 1 is shown in single quotes because site is a character variable in this data set. The DROP option allows variables not required to be omitted from the new data set. It can be included in either the DATA or SET statement but you have to be a bit careful. If DROP is used in the SET statement then the variables involved cannot be used for further processing. In this example using the DROP option with the variable site in the SET statement would result in an error message – try it! The following table shows some of the commonly used options in the DATA and SET statements. SAS Programming Notes 64 Modifying data and output Data Set Options Description DROP= Specify one or more variables to exclude either from further processing or from the new data set FIRSTOBS= Specify the first observation required for processing KEEP= Specify one or more variables to include in further processing or in the new data set OBS= Specify the last observation required for processing LABEL= Specify names to be given to variables (see section Labelling output) RENAME= Specify new names for variables WHERE= Specify a condition to select certain observations from a SAS data set 8.3. DROP and KEEP The following code shows how the DROP and KEEP options may be used in a program. data lengths (keep=height cond mlength flength) damage (drop=mlength flength); set unit8.beetles (drop=site); run; proc print data=lengths; proc print data=damage; run; Note that more than one data set can be specified in a single DATA statement. In this example two temporary data sets are produced lengths and damage. Task 1 Look at the preceding code and see if you can work out which variables will be contained in each data set. Submit the code and see if you are correct. If you wanted to create permanent data sets what changes to the code would you have to make? SAS Programming Notes 65 Modifying data and output 8.4. Labelling output Variable names in SAS are restricted to being eight characters in length, and by default these variable names are used as column headings. A LABEL statement may be used to associate a descriptive label with a variable. If the labels are required in the output then either a LABEL or SPLIT option must be used in the PROC PRINT statement. If a LABEL statement is made in a DATA statement then the labels are permanently associated with the variables in that SAS data set. An example of the use of labels follows. proc sort data=unit8.beetles1 out=sorted; by cond height; proc print data=sorted split=’*’; var height mlength area frass; by cond; label mlength='male*length' area='leaf area*consumed' frass='number of*frass*pellets'; pageby cond; sum frass; run; Task 2 Submit the code in SAS and inspect the output. What effect have the PAGEBY and SUM statements had? If you had used label names without the asterisks and used the LABEL option in the PROC PRINT statement then SAS would split the label names automatically at a suitable place but you have no control over the process. SAS Programming Notes 66 Modifying data and output 8.5. PROC PRINT Options Description DATA= Data set to be used, uses the last data set created by default LABEL Ensures that column labels are used in the output SPLIT= Specify a character in the label names which splits the column headings onto two or more lines NOOBS Suppresses the printing of observation numbers in the output Statements VAR Specify variables to be printed ID Specify variable to use as identification instead of the observation number BY Specify a list of character variables (data must be sorted) to produce separate tables. PAGEBY Used with BY statement to output each table on a separate page. SUMBY Prints subtotals for the specified BY variable SUM Specify numeric variables for which the sum of the values is required. 8.6. PROC FORMAT The LABEL statement allows you to give longer names to variables so that any output is easier to interpret. It is also possible to assign names to individual categories for character variables and to save these names as permanent formats. This is done using PROC FORMAT. The permanent formats are saved in a location which is specified using the LIBNAME statement with the special libref name LIBRARY. For example, Libname library ‘C:\sasdata’; will store the formats in a directory sasdata on the hard disk when the LIBRARY option is used in the FORMAT procedure. Task 3 Assign a library called LIBRARY in a suitable location. (The location where you have stored your SAS permanent data sets is probably the most appropriate.) The individual names are assigned using a value statement. These formats are independent of any particular data set and if appropriate may be used with any variable. The following example shows suitable labels for the plant height and plant condition categories from the beetles data but the format SAS Programming Notes 67 Modifying data and output $health, for example, could be used for any variable where 1, 2 and 3 represent poor, satisfactory and good respectively. proc format library=library; value $height '1'='less than 10cm' '2'='10cm<20cm' '3'='20cm<30cm' '4'='30cm<40cm' '5'='40cm or more'; value $health '1'='poor' '2'='satisfactory' '3'='good'; run; When these labels are required in output the format names must be shown with a full-stop in the FORMAT statement. proc print data=unit8.beetles; format height $height. cond $health.; run; Task 4 Create the above formats and print out the beetles data set to see the effect of using these formats. Check the library called LIBRARY. You should find a catalog called FORMATS. Double-clicking on FORMATS will give a list of all permanent formats you have created. The permanent formats which have been created may be used at any time. If you want to use them in a future session remember to use a LIBNAME statement initially to assign the libref LIBRARY. (To use more than one permanent format library use options fmtsearch - see HELP for details) _____________________________________________________________ 8.7. Recoding data You may find that when you are presenting results, or carrying out an analysis of a set data that you may wish to code a continuous variable such as height into discrete categories, for example, short, medium and tall. In some circumstances you may wish combine categories for an analysis. This type of operation can be done either by using conditional statements or making use of the VALUE statement in PROC FORMAT. _____________________________________________________________ 8.8. Conditional statements Conditional statements can take several forms (see the SAS Help). Two commonly used in recoding are: SAS Programming Notes 68 Modifying data and output if expression then statement; if expression then statement; else statement; The following code shows the use of the IF-THEN/ELSE statement for recoding pulse1 and pulse2 into four numeric categories. data unit8.pulse2 (keep=pulse1 pulse2 ran); set unit8.pulse; if 40 <= pulse1 < 60 then pulse1=1; else if 60 <= pulse1 < 80 then pulse1=2; else if 80 <= pulse1 < 100 then pulse1=3; else pulse1=4; if 40 <= pulse2 < 60 then pulse2=1; else if 60 <= pulse2 < 80 then pulse2=2; else if 80 <= pulse2 < 100 then pulse2=3; else pulse2=4; run; proc freq data=unit8.pulse2; tables pulse1*pulse2/nocol nocum nopercent norow; by ran; run; _____________________________________________________________ 8.9. VALUE statement An alternative way to code data is to create new formats for the required variables and use these formats when they are needed to produce particular output. For example putting the pulse rates into categories and creating a two way table could be done by creating a new format as follows. libname library 'c:\sasdata'; run; proc format library=library; value pulse 40-59=1 60-79=2 80-99=3 100-200=4; run; proc freq data=unit8.pulse; tables pulse1*pulse2/nocol norow nocum nopercent chisq; by ran; format pulse1 pulse2 pulse.; run; Task 5 Try out these alternative ways of recoding the data. What do you think are the advantages and disadvantages of each method? What do the resulting tables tell you about the relationship between the first and second pulse rates in each group? SAS Programming Notes 69 Modifying data and output The values that may be assigned to a particular code or description may be specified in the following ways. Range specification in the VALUE statement Description value a single value value1-valuen a range of values value1, value2, …. a list of values HIGH the highest possible value LOW the lowest possible value OTHER anything that does not fall into any range For example a format for age groups could be created as follows. libname library 'c:\sasdata\'; run; proc format library=library; value agegroup low-24=’under 25’ 25-49=’25 or more but less than 50’ 50-high=’50 or over’; run; ____________________________________________________________________ 8.10. OUTPUT The OUTPUT statement is used in conjunction with the SET statement to create multiple SAS data sets. The IF statement is used with the OUTPUT statement to control which observations are output to which SAS data sets. e.g. data american japan british; set mydata.cars; if origin = 1 then output american; else if origin = 2 then output japan; else if origin = 3 then output british; run; Exercises 8.1 Create a new temporary data set, using the pulse data, named mpulse which contains data for males only. Omit the variables sex, height and weight in this set. Label pulse1 ‘First pulse rate’, pulse2 ‘Second pulse rate’ and print out the data set using these variable labels. 8.2 Create formats for the alphanumeric variables in the pulse data to give the following information: SAS Programming Notes 70 ran 1=ran in place 2=did not run in place smokes 1=smokes regularly 2=does not smoke regularly sex 1=male 2=female activity 1=slight 2=moderate 3=a lot Modifying data and output Construct two way tables which show frequencies only for the following pairs of variables, using the formats you have created. (i) sex and smokes (ii) smokes and activity (iii) sex and ran 8.3 In Task 5 there was a warning that the chi-squared test may not be valid because of small expected numbers. Create a new format for the pulse rates which has two categories only. (Choose the categories so that there are roughly equal numbers in each category for this particular set of data. Repeat the chi-squared tests for independence for each of the two groups (ran/did not run) as in Task 5 and comment on the results. SAS Programming Notes 71 Proc Tabulate 9. PROC TABULATE Proc tabulate displays descriptive statistics in tabular format. It computes many of the same statistics that are computed by other descriptive statistical procedures such as MEANS, FREQ, and SUMMARY, but incorporates more flexibility. Statement Description s CLASS Identifies the categories on which calculations are carried out Are either character or discrete numeric. e.g. a department code Supplies the values used in the structure of the table Must be present in PROC TABULATE statements VAR Contain values appropriate for calculating statistics Are continuous numeric Supplies the values in the table cells Optional in PROC TABULATE statements TABLE BY FORMAT FREQ element1 * element2 where the elements are class variables and optionally, the ALL variable and/or various statistics Table operators:comma - produces a multidimensional table asterisk - produces a hierarchical table blank - concatenates tables Specify a list of character variables (data must be sorted) Specify variables with the formats wanted Specify the variable KEYLABEL Used to label the statistics available e.g. keylabel all=’Grand Total’; LABEL Specify a label for a variable WEIGHT Specify a variable to be used for weighting the entries in the table PROC TABULATE is the only procedure that has a SAS manual of its own. It is worth understanding the format of the table statement that controls the position of the variables. There are 3 operators that determine where the variables are positioned in the output. Notice that the variables must be categorical. If the variables are numeric then the CLASS statement is used to tell the SAS system to treat the numeric variables as if they categorical. 9.1. The comma table operator Determines page, row and column positions, i.e. cross tabulation. Table <page variable>, <Row variable>, <Column variable> ; e.g. proc tabulate data = students; class faculty sex; table faculty, sex; run; SAS Programming Notes 72 Proc Tabulate SEX Male Female N N FACULTY Science Arts 450 350 360 550 9.2. The asterisk table operator Nests two variables in the column or row Table <Column variable1> * <Column variable2> ; e.g. proc tabulate data = students; class faculty sex; table faculty*sex; run; FACULTY Science Arts SEX SEX Male Female Male Female N N N N 450 360 350 550 9.3. Using a blank table operator Places variables side by side Table <Row variable>, <Column variable1> <Column variable2> ; e.g. proc tabulate data = students; class faculty sex; table faculty sex; run; FACULTY Science Arts N N 810 900 SEX Male Female N N 800 910 SAS Programming Notes 73 Proc Tabulate 9.4. Using the ALL variable e.g. proc tabulate data = students; class faculty sex; table faculty all, sex all; run; Male N FACULTY Science Arts 450 350 800 SEX Female N ALL N 360 550 910 810 900 1710 9.5. Other Statistics The statistics that can be requested in PROC TABULATE include the following N SUM NMISS MEAN STD MIN MAX number of nonmissing observations sum of the VAR variable for each class of the CLASS variables number of missing observations arithmetic means standard deviation minimum value maximum value e.g. proc tabulate data = students; class faculty sex; var exammark; table faculty*sex*exammark*max; run; FACULTY Science SEX Male EXAMMARK MAX 83 Female EXAMMARK MAX 85 Arts SEX Male EXAMMARK MAX 79 Female EXAMMARK MAX 76 _____________________________________________________________ SAS Programming Notes 74 Proc Tabulate The operators can be mixed to produce any output you require. For example it is possible to nest 2 variables on the rows and columns of a table. Table <Row var1>*<Row var2>, <Column var1>*<Column var2> ; Exercises 9.1 Use proc tabulate on the pulse data to produce (a) the average of pulse 1 in a table of activity by sex (b) the average difference in pulse in a table showing whether they smoke by whether they ran or not _____________________________________________________________ SAS Programming Notes 10. 75 Functions and formats Functions and formats Functions are a useful means of writing SAS code because it simplifies the coding involved and often results in you having to write fewer lines of code. Over 120 functions are available within the SAS system. Some functions operate on numeric values, others on character values. Some are specialised and operate on specific types of values such as dates and times. All functions operate on arguments which may be variable names or specific values _____________________________________________________________ 10.1. MEAN function Example data myinfo; set info; m_val = mean (var1, var2, var3); run; If the info data set consisted of the following data, var1 2.5 6.0 var2 5 3 var3 1.5 3.0 then myinfo would be as follows:var1 2.5 6.0 var2 5 3 var3 1.5 3.0 m_val 3 4 The mean function calculates the mean of the three variables listed. Alternatively we could have written the expression as m_val = mean (of var1 - var3); An important difference between the MEAN function and the expression: m_val = (var1 + var2 + var3)/3; is that the MEAN function returns the mean of the nonmissing values. So if we had a missing value for var2, the function would return the mean of var1 and var3 whereas the expression above would return a missing value if any of the var values were missing. SAS Programming Notes 10.2. 76 Functions and formats NMISS function The NMISS function returns the number of missing values in a list of variables. This can be useful if for example we want to exclude observations from a calculation where there are too many missing values e.g. suppose we have recorded 50 readings for each instrument and want to compute the mean of these 50 readings, but only for those instruments with at least 30 readings data myavges; input (x1 - x50); if nmiss (of X1 - X50 ) lt 30 then ave = mean (of X1 - X50); cards; etc. _____________________________________________________________ 10.3. N function The N function operates in a similar fashion, but returning the number of nonmissing values. _____________________________________________________________ 10.4. String Handling Functions The above functions are for numeric variables and do not work with strings of characters. Some of the most commonly used character functions are SUBSTR (char_variable, starting_position, length) which extracts a substring INDEX (char_variable, index-string) which returns the position of a substring VERIFY (char_variable, verify_string) which returns the position in the char_variable that is not present in the verify string Example data dept; set mydir.jobs; tot = substr (‘ABCDEFG’, 3, 2); dept = substr (account, 4, 3); ind = index (account, ‘tch’); ver = verify (account, ‘sth’); run; If the data set mydir.jobs consisted of SAS Programming Notes account spsmrk spsmrk spstch 77 Functions and formats code m003 m005 t003 then the data set dept would consist of account spsmrk spsmrk spstch code m003 m005 t003 tot cd cd cd dept mrk mrk tch ind 0 0 4 ver 2 2 2 The substr function operates on character literals to extract part of a variable value. The structure of the substr function is substr(argument, position, length). The argument may be a character value or a variable name, the position gives the position from which to start reading, and the length gives the number of characters to read. Task Investigate what functions are available in SAS by selecting Help SAS Help and Documentation then choose SAS Products, Base SAS, SAS Language Dictionary, Dictionary of Language Elements and Functions and CALL Routines 10.5. Date and Time Formats The SAS System processes calendar date values by converting dates to integers representing the number of days between January 1 1960, and a specified date. For example, the following calendar date values represent the date July 26 1989: 072689 07/26/89 26JUL89 890726 26JUL1989 26 Jul 1989 The SAS date value representing July 26, 1989 is 10799. The trick is to convert dates to numerics and back again. SAS has many date, time and datetime informats and formats. We read the data in with date/time informats and get them back out of SAS using date/time formats. Many of the date/time informats are more or less the inverse of formats of the same name. The above dates would be read in using the following informats in the input statement e.g. SAS Programming Notes 78 Functions and formats data test; data test; input var1 MMDDYY6. +1 var2 DATE7. +1 var3 MMDDYY8. +1 var4 DATE9. ; cards; 072689 26JUL89 07/26/89 26JUL1989 ; run; To print them out all using different formats we reverse the process e.g. proc print; format var1 DATE9. run; Var2 MMDDYY6. Var3 DATE7. Var4 YYMMDD6.; The dot at the end of the informat (or format) indicates that it is an informat (or format) statement and not a variable. Details of the different formats and informats available in SAS can be found in the SAS System help. Blanks and other special characters can be placed between day, month, and year values. Width values must allow space for blanks and special characters. Note: SAS defaults to a date in the 1900s if yy is two digits. Use the YEARCUTOFF= system option to override the system default and specify a date range of your choice. Example Data Lines 1jan1990 01 jan 90 1 jan 90 1-jan-1990 SAS Statement input day date9.; Results 10958 10958 10958 10958 The TIMEw. informat reads time values in the form hh:mm:ss.ss, where hh and mm are integers representing the hour and minute, and ss.ss is an optional fractional field representing seconds and decimal fractions of seconds. If you do not enter a value for seconds, SAS assumes a value of 0. Example Data Line 14:22:25 SAS Statement Result input begin time8.; 51745 and the DATETIMEw. informat reads date and time values e.g. 8:30 p.m. of May 6 1989 could be represented as 6MAY78:20:30 using DATETIME12. SAS Programming Notes 79 Functions and formats Another way to specify SAS date/time values is with special constants e.g. 18 February 1951 is represented as ‘18FEB51’D, high noon as ‘12:00’T and a moment in date and time e.g. ‘1OCT82:15:27:05’DT Exercises 10.1 A character variable alphabet = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’ Work out what you think the following functions will return then check your answers with a SAS programme? (a) index(alphabet, ‘FGHIJ’) (b) index(alphabet, ‘JNP’) (c) substr(alphabet, 3, 3) (d) verify(alphabet, ‘ABDEI’) 10.2 Write a SAS program to find out how old you were on 1 October 1999 (in years). 10.3 Download the SAS data set napier.SAS7BDAT from the web. This contains data on first year Napier students in 1993/4. There are 1872 records and 33 variables. The variables that you need to consider are crsecd dob Numeric variable giving course code Date of birth Identify the oldest student. Which course is he/she in ?_________________________ Which course has the youngest students (on average)?_______________ Which courses have the largest numbers of students?__________________________ In order to answer these questions you will need to (a) create a variable giving the age in years. (b) Sort the data set by age, storing your sorted data set in a temporary data set. Examine the data set from the Explorer window to find the oldest student and their course. (c) Sort the data set by crsecd, storing your sorted data set in a temporary data set. Use proc means to make a new temporary data set that will contain the mean, minimum and maximum age of students on each course. Examine your data set of means to find the course with the youngest students (d) Sort the data set by the number of records for each course (_freq_) to find the course with the largest number of students SAS Programming Notes 11. 11.1. 81 Iterative processing Iterative processing Do loops and arrays The DO statement allows us to perform an iterative loop e.g. do i = 1 to 5; (lines of SAS code) end; would result in the lines of SAS code being repeated 5 times, with the value of i taking on values 1,2,3,4 and 5. i is used as the counter, 1 is the start value and 5 is the end value. The default increment is 1. We can specify the increment e.g. do i = 1 to 7 by 2; (lines of SAS code) end; would result in the lines of code between the do and the end statements being repeated 4 times, with the value of i taking on the values of 1,3,5,7. _____________________________________________________________ 11.2. Reading data in repeated patterns The quality control department takes 4 sample cans of oil from a production line and weighs them, every hour for 12 hours. Each record in the raw data contains the following fields : hour : the hour in which the samples were taken weight 1-4 : weights of the four sample cans The quality control department wants to analyse these data. The first step is to create a SAS data set so that it contains a single observation for each measurement taken. The DATA step must create four observations from each record. i.e. first record of raw data :1 8.024 8.135 8.151 8.065 first four observation in the data set:HOUR 1 1 1 1 WEIGHT 8.024 8.135 8.151 8.065 SAS Programming Notes 82 Iterative processing The first INPUT statement reads a value from the first field and assigns it to HOUR. The value for HOUR is the same for all four observation to be created from the first record. data oil1; input hour @ The single trailing @ sign is used to hold the current record, preventing the next INPUT statement from reading a new record. The next step is to read each value for WEIGHT (four in each record) and write an observation after each is read. An iterative do loop enables us to write a single pair of INPUT and OUTPUT statements to read a value and write an observation multiple times. data oil1 (drop=i) ; input hour @; do I = 1 to 4; input weight @; end; cards; 1 8.024 8.135 8.151 8.065 2 7.971 8.165 8.166 8.157 3 8.024 8.135 8.151 8.065 etc. 12 7.971 8.165 8.166 8.157 ; proc print; run; The results would be OBS 1 2 3 4 5 6 7 HOUR 1 1 1 1 2 2 2 WEIGHT 8.024 8.135 8.151 8.065 7.971 8.165 8.166 etc. _____________________________________________________________ 11.3. Arrays Arrays in SAS are used as a shorthand way of processing many variables with a few statements. An array is an ordered list of variable names. It is often used along with a DO statement to carry out an action repeatedly on a sequence of variables. When defining an explicit array, the ARRAY statement must contain SAS Programming Notes 83 Iterative processing an array name a subscript that indicates the number of elements in the array a dollar sign if the array is of character variables a description of the elements ( variables a, b, c, d and e in the above example) If we do not know the number of elements in an array we can use an asterisk to define the array e.g. array{*} score1 score 2 score 5 score 8; although this is a lot less efficient in processing time than specifying the actual array dimensions. Multidimensional arrays can be specified similarly e.g. array x{3,5} test1 - test15; could specify an array where the first dimension is the class number and the second is the test number ( i.e. 5 tests for each of 3 classes). The elements of this array are referred to by (for example) x{2,3} which gives the second row element of the third column of the array. _____________________________________________________________ Example The following program recodes missing scores in a test to 0 data results; infile class6; input id age score1-score5; if score1 = . then score1 = if score2 = . then score2 = if score3 = . then score3 = if score4 = . then score4 = if score5 = . then score5 = run; 0; 0; 0; 0; 0; This can be rewritten using arrays as follows data results; infile class6; input id age score1 - score5; array ss(5) score1-score5 do I = 1 to 5; if ss(I) = . then ss(I) = 0; end; drop I; run; The reduction in code is not very much with 5 scores but if we had 150 scores arrays would be much more efficient. _____________________________________________________________ SAS Programming Notes 11.4. 84 Iterative processing Generating random numbers In the data step examples we have examined so far, the SAS system has read a single record during the data step and written it to the output data set at the end of the data step. The SAS instruction to write a record to a data set is OUTPUT ( keep = vars) ; where vars stands for the variables to be saved. If you write a SAS data step that does not contain an OUTPUT statement, then SAS will assume that you want a record output at the end of the data step. If your program does contain an OUTPUT statement, then SAS will write a record at this point in the program and not at the end of the data step. The OUTPUT statement can be used to write a program that will produce a series of 100 records each containing a different uniform random number with this code. The seed 2762 can be replaced with any number you like. data random; do I = 1 to 100; x = ranuni (2762); output ; keep = x; end; run; The keep statement is optional; if it is not included all variables will be saved to the file. In generating random numbers, the ones at the beginning of a sequence can sometimes not be very random for certain choices of seed (check this from your own sequence), but they are generally OK after 500 numbers or so. To make sure that your sequence will be OK, use the following code at the beginning of any program that generates random numbers. do I = 1 to 500; x = ranuni (1279); * or any number as a seed; end; This will give the random number generator a whirl to ensure it is running smoothly. _____________________________________________________________ 11.5. Random numbers from a uniform distribution In order to be able to select a random sample or randomly assign subjects to groups we can use SAS functions to generate random numbers. The two functions UNIFORM(0) and RANUNI(0) generate uniform random numbers in the range from 0 to 1. Random number generators require an initial number, called a seed, which they use to calculate the first random number. This number is then used to generate the next and so on. For both of these functions, a seed of zero will cause the function to use a seed derived from SAS Programming Notes 85 Iterative processing the time clock, thus generating a different series each time it is used. The RANUNI function can be seeded with any number, but the UNIFORM function must be seeded with a 5, 6 or 9 digit odd number. In either case, if you supply the seed, the function will generate the same series of random numbers each time. In order to generate a series of random numbers from 1 to 100, we could use X = 1 + 99*RANUNI (0) _____________________________________________________________ 11.6. Random numbers from a normal distribution To generate random numbers from a normal distribution with a mean of 0 and a standard deviation of 1 we can use the RANNOR function which works in a similar way to RANUNI. _____________________________________________________________ 11.7. The SAS Program Data Vector While at work on a DATA step, the SAS System maintains a temporary data structure in computer memory called the SAS program vector, or PDV.. The program data vector represents one observation, a data row, and can be thought of as a linear set of boxes in which values of SAS variables can be contained. Unlike a SAS data set, which survives between steps, the PDV is a dynamic entity that is created during a DATA step, and goes away after the step has completed execution. To construct the PDV is one of the DATA step compiler’s first jobs as it passes through DATA step source code. The compiler looks at all the SAS statements in the step’s source to find out what variables are named - in INPUT, attribute, or other statements - and creates a PDV with space for each variable’s length. There are a couple of automatic system variables also in the program data vector : the variables _N_ and _ERROR_. These are maintained by the compiler and may be accessed by the program, though they do not get written to the new data set(s). _N_ contains a count of how many times the DATA step has begun execution from the top (i.e. the number of records), and _ERROR_ is set to 1 (true) when there occurs a data error. When a DATA step is executing, each time it begins another iteration the values of variables to be created by INPUT or by assignment in the PDV are initialised to missing (unless a RETAIN or Sum statement has been used. _N_ is incremented, and _ERROR is set to zero. When the DATA step returns to the top for the next iteration, the PDV is reinitialised and the process repeats. _____________________________________________________________ SAS Programming Notes 11.8. 86 Iterative processing The RETAIN and Sum statements Normally, variables in the PDV that are named with assignment or with INPUT statements are initialised to missing each time the DATA step begins a new iteration. The RETAIN statement, RETAIN [<variables [value]> ....]; causes variables to keep their values from the previous iteration at initialisation. They can still be changed if INPUT reads a new observation, or when an assignment statement (including the Sum statement ) is executed. The RETAIN statement can only be applied to ‘new’ variables, that is, ones which are being created within the DATA step. If a constant value is specified after the variable list, that value is given at the first iteration; otherwise, numeric variables start with a value of zero. Example Given the SAS data set history : Year 1990 1990 1990 1990 1990 Month 1 2 3 4 5 No_sold 10 12 8 6 9 the program : data changes; retain no_last; set course.history; compare = no_sold - no_last; no_last = no_sold; run; produces the PDV _N_ 1 2 3 4 5 _ERROR_ 0 0 0 0 0 YEAR 1990 1990 1990 1990 1990 MONTH 1 2 3 4 5 NO_SOLD 10 12 8 6 9 COMPARE . 2 -4 -2 3 NO_LAST 10 12 8 6 9 The following sum statement is a special type of assignment statement, provided as a convenience for incrementing variables during the DATA step. The statement a + 7; is identical in action to the statements RETAIN A; SAS Programming Notes 87 Iterative processing A = A + 7; i.e. 7 is added to the previous value of a Exercises 11.1 Results of a survey are recorded for 1996 and 1997. However an extra question was asked in 1997. Create a SAS data set from the following data showing year and the answers to the questions. 1996 1996 1997 1997 4 5 3 5 8 7 5 3 3 4 4 4 5 5 7 3 6 6 4 6 5 8 5 7 6 8 11.2 Rewrite the following program using arrays:data test; input a b c x1-x3 if a = 999 then a if b = 999 then b if c = 999 then c y1-y3; = .; = .; = .; if x1 = 999 then x1 = .; if x2 = 999 then x2 = .; if x3 = 999 then x3 = .; if y1 = 999 then y1 = .; if y2 = 999 then y2 = .; if y3 = 999 then y3 = .; datalines; 3 5 2 7 2 9 4 7 3 8 3 0 run; 5 2 3 999 2 4 999 2 999 5 4 7 999 999 1 11.3 The chi-square distribution is the sum of the squares of k independent standard normal random variables. Generate and plot the frequency distributions of chi-squared variables with 5, 10 and 30 degrees of freedom. SAS Programming Notes 89 12. 12.1. Further topics Further topics Combining Data Sets 12.1.1. Concatenating Data Sets To concatenate data sets means to combine similar data sets into a single new data set. In its simplest form the original data sets contain the same variables and the combined data set will contain the original data sets ‘one on top of the other’. Suppose SAS data set wood08 contains values for the five variables id (alpha-numeric), age, diameter, height and variety (alpha-numeric) obtained from a survey of trees in a wood having reference number 08. Suppose further that similar results are available in SAS data sets wood48 and wood69. The following code will combine the three data sets into a new SAS data set woodcom (in the order 08, followed by 48, followed by 69). data woodcom; set wood08 wood48 wood69; run; If the original data sets contain different variables then the combined data set will have missing values in an obvious way. 12.1.2. Merging Data Sets We merge data sets when we combine data sets containing different information. In its simplest form (match-merging) we combine the data sets on the basis of a common variable which typically identifies each case or row. Consider again SAS data set wood08 containing values for the five variables id (alpha-numeric), age, diameter, height and variety (alpha-numeric). Suppose a related SAS data set woodrs08 contains values for the following variables: id (alpha-numeric) plus numerical scores for damage, bark_dep and condit. We will merge the two data sets into a combined data set having variables id, age, diameter, height, variety, damage, bark_dep and condit. Matching will be by the variable id (termed the ‘BY variable’). However, we must sort both data sets by the BY variable. This is illustrated in the following code. proc sort data=wood08; by id; run; * Sort the first data set by id ; proc sort data=woodrs08; by id; run; * Sort the second data set by id ; SAS Programming Notes 90 Further topics data woodall08; * Merge the two data sets using id as a key ; merge wood08 woodrs08; by id; run; The MERGE operation will combine cases having the same values for id in data sets wood08 and woodrs08. If there values exist for id that are not included in both data sets then missing values for the relevant variables will be inserted. Exercises 12.1. The ASCII data set mod273ft.txt contains results from a module taught to full-time students with values for student name (alphanumeric), CW1, CW2, CW3, combined coursework and exam. Data set mod273pt.txt contains the same information for part-time students. The data sets are available on the module web page and should be examined with Notepad or WordPad. (a) Write a program which reads in the two ASCII data sets and combines them into a single permanent SAS data set with suitable variable names. Confirm that the data set has been constructed correctly. (b) Modify your program so that the combined data set contains a new variable indicating which group (full-time or part-time) each student belongs to. 12.2. The ASCII data set mod273ptsp.txt gives background information on the statistical software that is used at work by the part-time students. Values are given for name, excel, sas, spss (all alpha-numeric, software results are Y or N). Write a program which reads in the two data sets mod273pt.txt and mod273ptss.txt and merges them into a single permanent SAS data set. Confirm that the data set has been constructed correctly. 12.2. Hints on Using Word with SAS and SAS/INSIGHT 12.2.1. Copying a Selection from the Output Window (WINDOWS) In SAS Output Window Highlight text Edit Copy In Word document Edit Paste SAS Programming Notes 91 Further topics Note: In the event of difficulty use Ctrl/C for Copy Ctrl/V for Paste 12.2.2. Saving the Whole Output Window In SAS Output Window File Save As Choose directory and file name The automatic file extension is .lst Word and other text editos have no difficulty opening or inserting a list file. 12.2.3. Choice of Font within Word Document For tables of figures you are recommended to use a monospace font, i.e. one that has a constant width for all characters. Arial and Times New Roman are not monospace fonts. Examples are SAS Monospace (12 point) 1 2 3 4 5 (You may need SAS running to get this font.) SAS Monospace (10 point) 1 2 3 4 5 Courier New (10 point) 1 2 3 4 5 Courier New (12 point) 1 2 3 4 5 To avoid tables ‘wrapping round’ you could remove unnecessary spaces to the left or reduce the size of the font. It can be effective to use different fonts for different parts of the documents. For example you might use a standard font like Arial or Times New Roman for text and a monospace font for tables. Monospace fonts might also be used for file names etc. 12.2.4. Copying from a Graphics Window (WINDOWS) In SAS Output Window Edit Copy In Word document Edit Paste Special Choose Device Independent Bitmap To reduce the file size Edit Paste Followed by Edit Cut; Edit Paste Special SAS Programming Notes 92 Further topics Choose png or jpeg When you try to reposition the picture you may find that the picture jumps around the document in an uncontrollable manner. This can be eliminated by adding dummy returns that will lie under the picture. Next using ‘In front of text’ of wrapping from the layout tab of the picture format dialogue box. (Right click on the picture to select the format option). The dummy returns must be held together using paragraph formatting and the picture anchor locked onto the dummy returns. This is shown in the diagram below. Picture Picture anchor locked (Advanced setting) Enter returns, select all then Format Paragraph Keep lines together + Keep with next Format Picture dialogue Figure 14 Picture control using ‘In front of text’ wrapping style. 12.2.5. Copying and Pasting from SAS/INSIGHT Tables Tables can be saved in graphics form or as text (recommended). In SAS/INSIGHT Analysis Window Click on ‘arrow’ at top of table Choose Save The table (as text) is put into the base SAS Output Window. Copy from Output Window as described above. See below for saving table in graphics form. Graphs (WINDOWS) In SAS/INSIGHT Analysis Window Click on border of graph (or table) Edit Copy SAS Programming Notes In Word document 93 Further topics Edit Paste Special Choose Device Independent Bitmap Note: If you have highlighted points on your graph hold down Ctrl when you click on the border. 12.2.6. Left Alignment of SAS Output Copying and pasting from the SAS Output Window is easier if the output is already aligned to the left. The following option will ensure that all future output is left aligned: options nocentre; run; SAS Programming Notes 13. 95 Solutions to exercises 1.1 data ELEVEN; input ID $ height weight; ratio=weight/height; datalines; 59 135 25 82 146 33 27 153 56 52 154 51 55 139 31 13 131 25 01 149 43 15 137 32 71 133 30 78 149 35 12 141 33 37 164 48 28 146 37 48 149 45 69 147 36 16 152 47 run; /* Data came from Exercise 1.1*/ proc means; run; 1.2 ... bmi = 100*weight/height; … 1.3 data FORTH; input site $ salinity phos phos2=1000*phos; datalines; CR 30.11 0.068 0.297 1.693 WG 31.48 0.059 0.165 1.464 EG 31.79 0.068 0.144 1.100 SF 31.37 0.185 0.278 1.787 PB 31.50 0.116 0.223 2.099 JO 31.60 0.106 0.207 1.067 SS 30.50 0.047 0.162 1.563 FN 31.96 0.060 0.130 0.753 run; proc means; run; 2.1 nitrogen chloro faecal_c; 2.917 3.149 3.196 3.418 3.049 2.903 2.895 2.797 Proc Tabulate SAS Programming Notes 96 libname week2 'c:\sas\sasdata'; data week2.eleven; input ID $ height weight; ratio=weight/height; /*Program must be cut and pasted from word document*/ datalines; 59 135 25 82 146 33 27 153 56 52 154 51 55 139 31 13 131 25 01 149 43 15 137 32 71 133 30 78 149 35 12 141 33 37 164 48 28 146 37 48 149 45 69 147 36 16 152 47 run; 2.2 (b) libname week2 'c:\sas\sasdata'; data week2.pulse; infile 'c:\sas\sasdata\pulse.dat'; input pulse1 pulse2 ran $ smokes $ sex $ height weight activity $; proc print ; run; 2.4 (b) libname mydir 'c:\temp'; data mydir.ex2_4; infile 'c:\temp\ex2_4.prn'; input size $ colour $ price cost; run; proc print data = mydir.ex2_4; var colour size price; run; 2.4 (c) libname mydir 'c:\temp'; data mydir.ex2_4; infile 'c:\temp\ex2_4.prn'; input size $ 1-8 colour $ 9-19 price 20-24 cost 25-32; run; proc print data = mydir.ex2_4; var colour size price; run; 2.4 (d) libname mydir 'c:\temp'; data mydir.ex2_4; infile 'c:\temp\ex2_4.prn'; input @1 size $8. @9 colour $11. @20 price 5.2 @29 cost 4.2; run; proc print data = mydir.ex2_4; var colour size price; run; Proc Tabulate SAS Programming Notes 97 2.5 libname mydir 'c:\temp'; data mydir.houses; infile 'c:\temp\houses.dat'; input style 1 sqfeet 3-6 bedroom 8 baths 10-12 price 14-19; run; proc print; run; 2.6 libname mscsas 'c:\kirsty\mscsas'; data cars; infile 'd:\kirsty\temp\cars.prn' firstobs = 2 ; input mpg 1-8 cylndrs 9-16 displace 17-24 hrsepwr 25-33 accel 34-41 year 42-49 weight 50-57 origin 58-65 make $ 66-75 model $ 76-89 price 90-93; run; 4.1(a) options nocentre; libname unit4 'c:\sas\sasdata'; proc sort data=unit4.pulse out=smokes; by smokes ran; proc means maxdec=1; var pulse2; by smokes ran; run; (b) options centre; proc means data=smokes alpha=0.05 maxdec=1 clm; var pulse2; by smokes ran; run; (c) and (d) proc univariate data=smokes plot; var pulse2; by smokes ran; proc freq data=smokes; tables smokes*ran; run; 4.2 Proc Tabulate SAS Programming Notes 98 proc sort data = sashelp.retail out = sorted; by year; proc means maxdec=2 N mean std; var sales; by year; output out=summary mean=mean_sls; run; proc print data = summary; run; 5.1 proc plot data=unit5.eleven; plot weight*height='+'; run; 5.2 proc plot data=unit5.pulse; plot pulse2*weight=sex; run; proc chart data=unit5.pulse; pie activity; run; proc chart data=unit5.pulse; hbar ran/sumvar=pulse2 type=mean; run; proc chart data=unit5.pulse; vbar activity/type=pct group=sex; run; proc chart data=unit5.pulse; block activity/group=smokes sumvar=pulse1 type=mean; run; 6.1 libname unit6 'c:\sasdata'; run; proc plot data=unit6.beetles; plot area*mlength; run; proc reg data=unit6.beetles; model area=mlength; output out=regout student=stdresid; plot student.*mlength; run; proc univariate data=regout plot normal; var stdresid; run; 8.1 libname unit8 'c:\sasdata'; run; data mpulse (drop=sex height weight); Proc Tabulate SAS Programming Notes 99 Proc Tabulate set unit8.pulse (where=(sex='1')); label pulse1='First pulse rate' pulse2='Second pulse rate'; run; proc print data=mpulse label; run; 8.2 Note that the formats must correspond to that of the variables in the data set pulse, i.e. numeric variables need numeric formats. Numeric variables do not use quotes around the variable values. libname library 'c:\sasdata'; run; proc format library=library; value $ran '1'='ran in place' '2'='did not run in place'; value $smokes '1'='smokes regularly' '2'='does not smoke regularly'; value $sex '1'='male' '2'='female'; value $activity '1'='slight' '2'='moderate' '3'='a lot'; run; proc freq data=unit8.pulse; tables sex*smokes smokes*activity sex*ran/nocol norow nocum nopercent; format sex $sex. ran $ran. smokes $smokes. activity $activity.; run; 8.3 proc format library=library; value pulsrate low-76=1 77-high=2; run; proc freq data=unit8.pulse; tables pulse1*pulse2/nocol norow nocum nopercent chisq; by ran; format pulse1 pulse2 pulsrate.; run; 9.1 (a) proc format; value sex 1 = 'male' 2 = 'female'; value activity 1 = 'slight' 2 = 'moderate' 3 = 'a lot'; run; proc tabulate data = mydir.pulse; class sex activity; var pulse1; table activity,sex * pulse1*mean; format sex sex. activity activity.; title 'Average pulse without exercise'; run; 9.1(b) SAS Programming Notes 100 proc format; value smokes 1 = ‘smokes regularly’ 2 = ‘does not smoke regularly’; value ran 1 = ‘ran in place’ 2 = ‘did not run in place’; run; data mydir.pulsev2; set mydir.pulse; pulsedif = pulse2 - pulse1; run; proc tabulate data = mydir.pulsev2; class smokes ran; var pulsedif; table smokes, ran * pulsedif*mean; format smokes smokes. ran ran.; title 'Average difference in pulse by whether they smoke and/or ran'; run; 10.1 data test; input alphabet $26.; pt1 = index(alphabet,'FGHIJ'); pt2 = index(alphabet,'JNP') ; pt3 = substr(alphabet,3,3) ; pt4 = verify(alphabet,'ABDE') ; datalines; ABCDEFGHIJKLMNOPQRSTUVWXYZ run; proc print; run; 10.2 data ageatoct; input dob YYMMDD6.; date = '01OCT99'D; ageatoct = (date-dob)/365; datalines; 740625 run; 10.3 libname mydisc 'd:\kirsty\temp'; * first set up a variable for age; data mydisc.napier2; set mydisc.napier; enddate = '31DEC93'D; * sets up a variable as a date constant; age = (enddate - dob)/365.25; run; * create a temporary SAS data set called agesort, sorted by age and examine data set to identify oldest student; proc sort data = mydisc.napier2 out = agesort; by age ; run; * create a temporary SAS data set called napsort, sorted by course; proc sort data = mydisc.napier2 out = napsort; by crsecd; run; Proc Tabulate SAS Programming Notes 101 Proc Tabulate * create a temporary SAS data set called means containing the mean, max and min ages by course. Examine the data set to find which course has the youngest students (on average) Anything strange?! ; proc means data = napsort; var age; by crsecd; output out = means mean = mnage max = maxage min = minage ; run; * sort by number of records for each course and examine dat set to find the course with the largest number of students; proc sort data = means; by _freq_; run; 11.1 data survey; input year @ ; * hold the line; if year = 1996 then input q1-q6; else if year = 1997 then input q1-q7; datalines; 1996 1996 1997 1997 ; run; 4 5 3 5 8 7 5 3 3 4 4 4 5 5 7 3 6 6 4 6 5 8 5 7 6 8 proc print; run; 11.2 data test; input a b c x1-x3 y1-y3; array tt (9) a b c x1-x3 y1-y3; do i = 1 to 9; if tt(i) = 999 then tt(i) = .; end; drop i; datalines; 3 5 2 7 5 999 2 5 999 2 9 4 7 2 4 999 4 999 3 8 3 0 3 2 999 7 1 run; proc print; run; 11.3 data chisq; n = 6; * degrees of freedom required +1; do i = 1 to 500; x = rannor(3059); end; * to ensure numbers are random; do j = 1 to 500; chi = 0; do i = 1 to n; x = rannor(3059); chi = chi + x*x; *generate 500 chi-squared values; SAS Programming Notes 102 Proc Tabulate retain chi; end; output; keep chi; end; run; proc gchart; vbar chi; run; 12.1.(b) /* concatmod.sas */ libname xyz 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8'; data modft; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ft.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; group = 'FT'; run; data modpt; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; group = 'PT'; run; data xyz.mod273com; set modft modpt; run; proc print data=xyz.mod273com; run; 12.2. /* mergemod.sas */ libname abc 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8'; data modpt; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; run; proc sort data=modpt; by name; run; data modptss; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ptss.txt'; input name $ 1-16 excel $ sas $ spss $; run; proc sort data=modptss; by name; run; data abc.modptcom; merge modpt modptss; by name; run; proc print data=abc.modptcom; run; SAS Programming Notes 103 proc univariate data = pulse; var before; histogram before / midpoints = 0 to 200 by 10; title ’Histogram Pulse Data’; run; Proc Tabulate